Difference between revisions of "Raid-to-Silo Transfer Strategy"
Line 27: | Line 27: | ||
* Current details of how these do this should be referred to the scripts themselves which have extensive comments at the top describing what they do. Here is an overview: | * Current details of how these do this should be referred to the scripts themselves which have extensive comments at the top describing what they do. Here is an overview: | ||
+ | |||
'''hd_stage_to_tape.py :''' | '''hd_stage_to_tape.py :''' | ||
− | This script will search for completed runs in the "active" directories of all partitions. For | + | |
+ | This script will search for completed runs in the "active" directories of all partitions. For any it finds it will: | ||
# Move the run directory to "volatile" | # Move the run directory to "volatile" | ||
# Create a tar file of any subdirectories (e.g. RunLog, monitoring, ...) | # Create a tar file of any subdirectories (e.g. RunLog, monitoring, ...) | ||
# Make a hard link to every evio file and tar file in the "staging" directory | # Make a hard link to every evio file and tar file in the "staging" directory | ||
+ | # Run the ''jmigrate'' program (provided by Scientific Computing Group) to copy files from "staging" to tape | ||
+ | |||
+ | |||
+ | '''hd_link_rundirs.py''' | ||
+ | |||
+ | This script will make a symbolic link in ''/gluex/data/rawdata/all'' for each run directory. It does this for all partitions so there is a single location one needs to look to find all run director | ||
+ | |||
+ | '''hd_disk_map_and_free.py''' | ||
+ | '''hd_copy_sample.py''' | ||
Revision as of 15:23, 26 September 2016
Contents
[hide]The basics
Getting data to the tape library
Files are copied to the tape library in the Computer Center (bottom floor of F-wing) via a multi-stage process. The details are as follows:
- The DAQ system will write data to one of the 4 partitions on gluonraid3
- The partition being written to is changed every run by the script $DAQ_HOME/scripts/run_prestart_sync
- It does this by running /gluex/builds/devel/packages/raidUtils/hd_rotate_raid_links.py which updates the links:
/gluex/data/rawdata/prev <- Link to partition where previous run was written /gluex/data/rawdata/curr <- Link to partition where current(most recent) run is(was) written /gluex/data/rawdata/next <- Link to partition where next run will be written
- Each partition has 3 directory trees used to maintain the data in various stages as it is copied to tape and kept for potential offline analysis within the gluon cluster
/gluonraid3/dataX/rawdata/active <- directory tree data is wrritten to by DAQ /gluonraid3/dataX/rawdata/volatile <- directory tree is moved to for later analysis on gluons /gluonraid3/dataX/rawdata/staging <- directory tree with files hard linked to volatile for copying to tape
- A series of cron jobs on gluonraid3 in the hdsys account moves and links the data among these directories.
- These cron jobs are based on 4 scripts:
/gluex/builds/devel/packages/raidUtils/hd_stage_to_tape.py /gluex/builds/devel/packages/raidUtils/hd_link_rundirs.py /gluex/builds/devel/packages/raidUtils/hd_disk_map_and_free.py /gluex/builds/devel/packages/raidUtils/hd_copy_sample.py
- Current details of how these do this should be referred to the scripts themselves which have extensive comments at the top describing what they do. Here is an overview:
hd_stage_to_tape.py :
This script will search for completed runs in the "active" directories of all partitions. For any it finds it will:
- Move the run directory to "volatile"
- Create a tar file of any subdirectories (e.g. RunLog, monitoring, ...)
- Make a hard link to every evio file and tar file in the "staging" directory
- Run the jmigrate program (provided by Scientific Computing Group) to copy files from "staging" to tape
hd_link_rundirs.py
This script will make a symbolic link in /gluex/data/rawdata/all for each run directory. It does this for all partitions so there is a single location one needs to look to find all run director
hd_disk_map_and_free.py
hd_copy_sample.py
- The DAQ system will write files to either the gluonraid1 or gluonraid2 RAID disks in the "/raid/rawdata/active/$RUN_PERIOD/rawdata/RunXXXXXX" directory
- (n.b. only gluonraid1 and gluonraid2 will have a /raid directory. Other computers will mount these as /gluonraid1 and gluonraid2)
- The current disk being used by the DAQ system is pointed to by the symbolic link "/gluex/raid" which will point to either "/gluonraid1" or "/gluonraid2"
- (the DAQ system may change the disk /gluex/raid points to since the disk is determined by which node the Event Recorder is run on)
- Files will be written to subdirectories containing both the RUN_PERIOD and run number. For example:
- /gluonraid1/rawdata/active/$RUN_PERIOD/rawdata/RunXXXXXX
- where RUN_PERIOD is an environment variable set in the /gluex/etc/hdonline.cshrc script and XXXXXX is the 6-digit, zero-padded run number.
- The RunXXXXXX directory will contain:
- all raw data files for the run
- a tarball of the DAQ settings
- a monitoring directory containing the online monitoring histograms generated for the run.
- A cron job run by the hdsys account on gluonraid1(2) will run the following script every 10 minutes:
- The stage_to_tape script will:
- move any files whose modification time is greater than 10 minutes old from the /gluonraid1/rawdata/active/$RUN_PERIOD directory to the /gluonraid1/rawdata/volatile/$RUN_PERIOD directory, preserving any directory structure.
- make a hard link in the /gluonraid1/rawdata/staging/$RUN_PERIOD directory to the file in /gluonraid1/rawdata/volatile/$RUN_PERIOD, again, preserving directory structure.
- A cron job run by the root account on gluonraid1 will run the jmigrate program every 10 minutes. This will:
- copy any files whose modification time is greater than 5 minutes old found in the /gluonraid1/rawdata/staging/$RUN_PERIOD directory to the tape library, preserving any directory structure.
- unlink any files in /gluonraid1/rawdata/staging/$RUN_PERIOD for which the copy is complete. This is determined by finding a file already on tape with the correct checksum.
- Stub files referring to the tape-resident files will be placed in the following directory on the JLab CUE:
- /mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX
- On the CUE (and therefore in the tape library), the files are owned by the gluex account.
Disk space on RAID
There are two 72TB RAID disks in the counting house named gluonraid1 and gluonraid2. (They actually report to be 77TB, but only about 72-73TB is usable.) These are primarily used for storing the raw data as it is collected. Standard operating procedure is to delete files from one disk to make space just before switching to that disk. This allows files to stay around for as long as possible for use in offline analysis.
Monitoring available space
- The currently selected RAID disk should always be linked from /gluex/raid
- The current RAID disk is monitored via the EPICS alarm system. The remaining space is written to an EPICS variable via the program hd_raidmonitor.py. This is always run by the hdops account on gluonraid1 regardless of whether it is the currently selected disk. This actually updates two EPICS variables:
- HD:coda:daq:availableRAID - The remaining disk space on the RAID in TB
- HD:coda:daq:heartbeat - Updated every 2 seconds to switch between 0 and 1 indicating the hd_raidmonitor.py program is still running
- The system is set to raise a warning if the available space drops below a few TB (ask Hovanes for exact number)
- The system is set to raise an error if the available space drops below ??TB or the heartbeat stops updating
- Note that at this time, the hd_raidmonitor.py is not automatically started by any system. Therefore, if gluonraid1 is rebooted, the program will need to be restarted "by-hand"
- To check if the hd_raidmonitor.py program is running, enter the following:
-
> $DAQ_HOME/tools/raidutils/hd_raidmonitor_test/watch_raidmonitor.py
-
- To start the hd_raidmonitor.py utility do this
-
> $DAQ_HOME/tools/raidutils/hd_raidmonitor.py >& /dev/null &
-
- Source for Hall-D specific tools used for managing the RAID are kept in svn here:
- A second cron job is run on gluonraid1 from the hdsys account to remove files from the gluonraid1/rawdata/volatile directory as needed to ensure disk space is available
- (n.b. at the time of this writing, the second cron job has not been set up!)
Changing Run Periods
We will update to a new Run Period at the beginning and end of each beam period. This means in between beam times when we take mainly cosmic data, it will go into a separate Run Period. For example, the first commissioning run took place starting in Oct. 2014 and ended Dec. 22nd 2014. All of that data was part of RunPeriod-2014-10. In January, RunPeriod-2015-01 was created. To begin a new run period one must do the following:
- Make sure no one is running the DAQ or is at least aware that the run period (and output directory) is about to change. They will probably want to log out and back in after you're done to make sure the environment in all windows is updated.
- Submit a CCPR requesting two new tape volume sets be created. An example of a previous request is:
- Hi,
We would like to have the following two tape volume sets created please:
/mss/halld/RunPeriod-2015-06/rawdata raw
/mss/halld/RunPeriod-2015-06 production
n.b. some details on how Dave Rackley did this before can be found in CCPR 108991
Thanks, - You should specify whether the volume is to be flagged "raw" or "production". The "raw" volumes are will be automatically duplicated by the Computer Center. The "production" volumes will not. We generally want all Run Periods to be "raw", even times when only cosmic data is taken as that is considered a valuable snapshot of detector performance at a specific point in time.
- To check the current list of halld tape volume sets:
- Go to the SciComp Volume Status page (This is in the "Operations" menu in the "SciComp Group" section on the left side of the SciComp pages.)
- Type "halld" into the search box
-
- Modify the file /gluex/etc/hdonline.cshrc to set RUN_PERIOD to the new Run Period name.
- Create the Run Period directory on both gluonraid1 and gluonraid2 by logging in as hdops and doing the following:
-
> mkdir /raid/rawdata/active/$RUN_PERIOD
-
> chmod o+w /raid/rawdata/active/$RUN_PERIOD
-
- Modify the crontab for root on both gluonraid1 and gluonraid2. Do this by editing the following file (as hdsys) to reflect the new Run Period directory:
-
/gluex/etc/crontabs/crontab.root.gluonraid
- when done, the relevant line should look something like this:
-
*/10 * * * * /root/jmigrate /raid/rawdata/staging/RunPeriod-2015-01 /raid/rawdata/staging /mss/halld
- this will run the jmigrate script every 10 minutes, copying every file found in the /raid/rawdata/staging/RunPeriod-2015-01 directory to the tape library, preserving the directory structure, but replacing the leading /raid/rawdata/staging with /mss/halld. It will also unlink the files and remove any empty directories.
- install it as root on both gluonraid1 and gluonraid2 via:
-
# crontab /gluex/etc/crontabs/crontab.root.gluonraid
-
- Change the run number by rounding up to the next multiple of 10,000.
- Make sure the DAQ system is not running
- Using the hdops account, edit the file $COOL_HOME/hdops/ddb/controlSessions.xml and set the new run number.