Raid-to-Silo Transfer Strategy

From GlueXWiki
Revision as of 13:28, 25 October 2013 by Wolin (Talk | contribs)

Jump to: navigation, search

Below is a proposal for a raid-to-silo transfer strategy for moving Hall D data files from our local raid server to the JLab tape storage facility. We will update this as our ideas develop.

Elliott Wolin
Dave Lawrence
24-Oct-2013


Notes

  • We will use the jmirror facility from the Computer Center to transfer the files.
  • jmirror deletes the link to the file when the transfer is complete. It does not delete directories, only files.
  • jmirror is fairly smart and reliable. It only deletes the hard link when the file is safely transferred.
  • CRON jobs will delete unneeded dirs after their contents are safely transferred.
  • jmirror is run periodically via a CRON job, it is not a tranfer server system. It transfers files it finds when it is run.
  • jmirror will not transfer files actively being written to, nor transfer files twice if invoked twice.
  • Additional hard links to the data file are untouched by jmirror. These can be used to keep the file on disk after transfer.
  • If files are kept they will be deleted "just-in-time" to make room for new DAQ files. This will require cleanup strategy and cron scripts to implement it.
  • The DAQ creates a 10 GB file every 30 secs, about 1 TB/hour. Thus a two hour run generates 2 TB.
  • Files will be queued up for transfer at the end of the run via run control scripts run under the hdops account.
  • Mark I prefers to store files by "run period" with a simple naming scheme (RunPeriod001, RunPeriod002 or similar).
  • Run periods are just date ranges. Run numbers will NOT be reused, i.e. all run numbers are unique across all run periods.
  • Due to constraints in the mss a second level of directories is needed. Mark and I propose simply organizing files by run, e.g. something like Run000001, Run000002, etc.
  • Run files will have the run number in them, e.g something like: Run000001.evio.001, Run000001.evio.002, etc.
  • A two-hour run will generate around 250 files.
  • The RAID sytem stripes data across all disks, independent of logical partitioning.
  • RAID disk partitions do not seem to be needed (see below), they can be implemented later if necessary.
  • mv and ln cannot create hard links across partitions, files have to be physically copied to put them on a different partition.
  • The raid server must simultaneously read and write at 300 MB/s, it's best to avoid additional file copying.
  • We have two completely independent RAID servers, 75 TB each.
  • All CRON jobs will run under the hdsys account.



Notes for Dec 2013 Online Data Challenge

  • We plan to use a basic autmomated file transfer mechanism in Dec that deletes files on transfer. If someone has the time we'll try just-in-time deletion.


Proposal

  • Use one RAID server with one DAQ partition. The second RAID server is a hot spare, to be used if the first one dies or if we lose connection to the CC and we need to store data locally.
  • Mechanism to switch to the spare RAID server to be determined. To start it can be manual, eventually it will be automated.
  • The ER will write data to a run-dependent active directory on the RAID server. The ER runs in the hdops account.
  • An end run script will move the all files from that run to a separate run-dependent staging directory on the RAID server.
  • The script will also eventually perform run bookkeeping tasks, and create additional hard links if/when we implement just-in-time deletion.
  • The jmirror cron job will be run every 5-10 minutes from the hdsys account.
  • The jmirror cron job simply transfers all files in the staging directory area to the tape storage facility. When transferred it deletes the hard links in the staging area.
  • A cron job will delete empty directories in the staging area, since being empty means all its files have been transferred.
  • A cron job will periodically check for files in active directories from previous runs that never got moved to the staging area. This can happen if the ER crashes or a run ends badly.


Tasks, Assignments and Schedule

  1. Install RAID system - Paul - by 30-Oct
  2. Install and test jmirror software and certificates - Paul and Chris L - by 1-Nov
  3. Set up active and staging directories on RAID server - Elliott - by 1-Nov
  4. Test transfer scheme using production directory strategy - Elliott and Paul - by 5-Nov
  5. Write and test jmirror CRON script - Paul and Elliott - by 8-Nov
  6. Write and test end run script - Elliott and Dave L - by 8-Nov
  7. Write and test cleanup CRON scripts - Elliott and Paul - by 13-Nov
  8. Full high-speed system test - Elliott and Dave L - before and during Dec 2013 Online Data Challenge