Raid-to-Silo Transfer Strategy

From GlueXWiki
Revision as of 08:12, 24 June 2014 by Davidl (Talk | contribs)

Jump to: navigation, search

Below is a proposal for a raid-to-silo transfer strategy for moving Hall D data files from our local raid server to the JLab tape storage facility. We will update this as our ideas develop.

Elliott Wolin
Dave Lawrence
29-Oct-2013


Notes

  • We will use the jmirror facility from the Computer Center to transfer the files.
  • jmirror deletes the link to the file when the transfer is complete. It does not delete directories, only files.
  • jmirror is fairly smart and reliable. It only deletes the hard link when the file is safely transferred.
  • CRON jobs will delete unneeded dirs after their contents are safely transferred.
  • jmirror is run periodically via a CRON job, it is not a tranfer server system. It transfers files it finds when it is run.
  • jmirror will not transfer files actively being written to, nor transfer files twice if invoked twice.
  • You cannot have two instances of jmirror running at the same time, they might clash over files that are partially copied.
  • Additional hard links to the data file are untouched by jmirror. These can be used to keep the file on disk after transfer.
  • If files are kept they will be deleted "just-in-time" to make room for new DAQ files. This will require cleanup strategy and cron scripts to implement it.
  • The DAQ creates a 10 GB file every 30 secs, about 1 TB/hour. Thus a two hour run generates 2 TB.
  • Files will be queued up for transfer at the end of the run via run control scripts run under the hdops account. Files will be owned by hdops on the raid system.
  • Files in the tape storage facility will be owned by the gluex account.
  • Mark I. prefers to store files by "run period" with a simple naming scheme (RunPeriod001, RunPeriod002 or similar).
  • Run periods are just date ranges. Run numbers will NOT be reused, i.e. all run numbers are unique across all run periods.
  • Due to constraints in the mss a second level of directories is needed. Mark and I propose simply organizing files by run, e.g. something like Run000001, Run000002, etc.
  • Run files will have the run number in them, e.g something like: Run000001.evio.001, Run000001.evio.002, etc.
  • A two-hour run will generate around 250 files.
  • The RAID sytem stripes data across all disks, independent of logical partitioning.
  • RAID disk partitions do not seem to be needed (see below), they can be implemented later if necessary.
  • mv and ln cannot create hard links across partitions, files have to be physically copied to put them on a different partition.
  • The raid server must simultaneously read and write at 300 MB/s, it's best to avoid additional file copying.
  • We have two completely independent RAID servers, 75 TB each.
  • CRON jobs will run under the hdsys or root account as appropriate (not out of the hdops account).



Notes for Dec 2013 Online Data Challenge

  • We plan to use a basic autmomated file transfer mechanism in Dec that deletes files on transfer. If someone has the time we'll try just-in-time deletion.


Proposal

  • Use one RAID server with one DAQ partition. The second RAID server is a hot spare, to be used if the first one dies or if we lose connection to the CC and we need to store data locally.
  • Mechanism to switch to the spare RAID server to be determined. To start it can be manual, eventually it will be automated.
  • The ER will write data to a run-dependent active directory on the RAID server. The ER runs in the hdops account.
  • An end run script will move the all files from that run to a separate run-dependent staging directory on the RAID server.
  • The script will also eventually perform run bookkeeping tasks, and create additional hard links if/when we implement just-in-time deletion.
  • The jmirror cron job will be run every 5-10 minutes from the root account. It will use a file lock to ensure only one copy runs at a time.
  • The jmirror cron job will transfer all files in the staging directory area to the tape storage facility. After transfer hard links in the staging area will be deleted.
  • The jmorror cron job will further delete empty directories in the staging area, since being empty means all its files have been transferred.
  • Another cron job will periodically check for files in active directories from previous runs that never got moved to the staging area. This can happen if the ER crashes or a run ends badly.


Tasks, Assignments and Schedule

  1. Install RAID system - Paul - by 30-Oct - done 29-Oct-2013
  2. Install and test jmirror software and certificates - Paul and Chris L - by 1-Nov - done 29-Oct-2013
  3. Set up active and staging directories on RAID server - Elliott - by 1-Nov - done 29-Oct-2013
  4. Test transfer scheme - Paul - by 5-Nov - done 29-oct-2013
  5. Write and test jmirror script - Paul and Elliott - by 8-Nov - done 29-Oct-2013
  6. Implement and test jmirror CRON job - Paul and Elliott - by 8-Nov - done 30-Oct-2013
  7. Write and test end run script - Elliott and Dave L - by 8-Nov
  8. Write and test cleanup CRON scripts - Elliott and Paul/Dave L - by 13-Nov
  9. Full high-speed system test - Elliott and Dave L - before and during Dec 2013 Online Data Challenge



Here is an e-mail exchange Elliott forwarded to me in September 2013 regarding the strategy for writing to the tape library.




-------- Original Message --------
Subject:	Re: Tape write speeds
Date:	Wed, 4 Sep 2013 14:18:05 -0400 (EDT)
From:	Sandy Philpott <philpott@jlab.org>
To:	Elliott Wolin <wolin@jlab.org>, Paul Letta <letta@jlab.org>, "scicomp@jlab.org" <scicomp@jlab.org>

One note, when it's installed - rather than using the Counting House staging fileserver mssstg.jlab.org
(Hall A/B/C's "mass storage system staging" node sfs61), Hall D will have their own staging fileserver to
write raw data to,  for copying raw data directly to tape by the hall data writing tool.Then the mssstg
node can serve as a backup.

From: "Christopher Larrieu" <larrieu@jlab.org>
To: "Kurt Strosahl" <strosahl@jlab.org>
Cc: "Sandy Phillpot" <philpott@jlab.org>, scicomp@jlab.org, "Elliott Wolin" <wolin@jlab.org>, "letta@jlab.org Letta" <letta@jlab.org>
Sent: Tuesday, September 3, 2013 9:20:41 AM
Subject: Re: Tape write speeds

Kurt and Elliot,

Please consider the following in benchmarking hall d to-tape data rates:

You will need to produce an appropriate quantity of data over a large enough time span to induce
steady-state behavior. Hall data is treated differently from user data, so you need to use the appropriate
process to shuttle data to tape. Though file size is largely irrelevant, you should nonetheless consider
that larger files are more cumbersome to deal with (for example, if some glitch interrupts transfer to
tape, the quantity of data that would need to be re-transmitted is potentially much greater).  On the
other hand, small files would necessarily require greater overhead in book-keeping and management. 
I would think something like 10G would be a good size to aim for given your data rates.

So my suggestion is the following:

Produce simulated data files (possibly the same one over and over again) of the size and at the rate you anticipate.
Place these files in sfs61:/export/stage/halld/whatever so they can be handled by the hall data writing tool.
Conduct the test without interruption for at least several days.
You will need to create the halld directory and also add a crontab entry to launch jmigrate.  You can use
entry for hallb as a template.  (I can't do this because I can't figure out how to sudo on this machine).

Chris

On Sep 1, 2013, at 11:17 PM, Christopher Larrieu <larrieu@jlab.org> wrote:

Please have Elliot come talk to me.  If he wants to benchmark writing to tape he should do it in a way that
does not go through the fairies.

Sent from my iPhone

On Aug 30, 2013, at 15:58, Kurt Strosahl <strosahl@jlab.org> wrote:

Chris,

 That user (Elliot Wolin) who was asking me about optimum write sizes for LTO6 tapes also asked me about
some writes hall D was doing... He said that he was getting some very slow writes, and when I dug into it I
found the below, can you think of a reason why there was a drop in write speed for some of those files?

scdm7 ~> grep hdops mover.log.2013-08-27
[2013-08-27 14:45:47] [jputter:72512043] INFO  JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.007) via user (hdops) proxy subprocess.
[2013-08-27 14:45:47] [jputter:72512042] INFO  JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.008) via user (hdops) proxy subprocess.
[2013-08-27 14:49:57] [jputter:72512043] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.007 10,000,671,248 bytes in 51.946 seconds (183.602 MiB/sec)
[2013-08-27 14:49:57] [jputter:72512044] INFO  JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.006) via user (hdops) proxy subprocess.
[2013-08-27 14:50:50] [jputter:72512042] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.008 10,000,174,392 bytes in 46.084 seconds (206.946 MiB/sec)
[2013-08-27 14:51:27] [jputter:72512041] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.009 10,000,991,092 bytes in 32.744 seconds (291.280 MiB/sec)
[2013-08-27 14:52:46] [jputter:72512045] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.002 10,001,220,200 bytes in 74.892 seconds (127.355 MiB/sec)
[2013-08-27 14:53:40] [jputter:72512044] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.006 10,000,160,936 bytes in 51.554 seconds (184.988 MiB/sec)
[2013-08-27 14:53:40] [jputter:72512049] INFO  JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.000) via user (hdops) proxy subprocess.
[2013-08-27 14:53:40] [jputter:72512050] INFO  JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.003) via user (hdops) proxy subprocess.
[2013-08-27 14:56:34] [jputter:72512046] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.005 10,000,174,576 bytes in 169.538 seconds (56.252 MiB/sec)
[2013-08-27 14:59:07] [jputter:72512047] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.001 10,000,190,452 bytes in 149.692 seconds (63.710 MiB/sec)
[2013-08-27 15:00:14] [jputter:72512049] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.000 10,000,188,212 bytes in 52.920 seconds (180.214 MiB/sec)
[2013-08-27 15:01:02] [jputter:72512050] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.003 10,000,246,016 bytes in 43.402 seconds (219.736 MiB/sec)
[2013-08-27 15:01:34] [jputter:72512048] INFO  ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.004 10,000,216,984 bytes in 30.424 seconds (313.468 MiB/sec)

Kurt J. Strosahl
System Administrator
Scientific Computing Group, Thomas Jefferson National Accelerator Facility


--
Christopher Larrieu
Computer Scientist
High Performance and Scientific Computing
Thomas Jefferson National Accelerator Facility






-- 

				Sincerely,
					Elliott
 

================================================================================


 Those raised in a morally relative or neutral environment will hold
		    no truths to be self-evident.
				   

Elliott Wolin
Staff Physicist, Jefferson Lab
12000 Jefferson Ave
Suite 8 MS 12A1
Newport News, VA 23606
757-269-7365

================================================================================