Difference between revisions of "Raid-to-Silo Transfer Strategy"
(→Changing Run Periods) |
(→Changing Run Periods) |
||
Line 98: | Line 98: | ||
# Submit a [https://misportal.jlab.org/mis/apps/ccpr/ccpr_user/ccpr_new_user_request.cfm CCPR] requesting two new tape volume sets be created. The "category" should be set to "Scientific Computing" and the subject something like "New tape volume sets for Hall-D". An example of a previous request is: | # Submit a [https://misportal.jlab.org/mis/apps/ccpr/ccpr_user/ccpr_new_user_request.cfm CCPR] requesting two new tape volume sets be created. The "category" should be set to "Scientific Computing" and the subject something like "New tape volume sets for Hall-D". An example of a previous request is: | ||
#: | #: | ||
− | #: <div style="background-color:lightgrey;"><samp><i> Hi,<br> We would like to have the following two tape volume sets created please:<br><br> /mss/halld/RunPeriod- | + | #: <div style="background-color:lightgrey;"><samp><i> Hi,<br> We would like to have the following two tape volume sets created please:<br><br> /mss/halld/RunPeriod-2019-01/rawdata raw<br>/mss/halld/RunPeriod-2019-01 production<br><br>n.b. some details on how Dave Rackley did this before can be found in CCPR 121062<br><br>Thanks,<br><br></i></samp></div> |
#: | #: | ||
#: | #: | ||
Line 105: | Line 105: | ||
#: To check the current list of halld tape volume sets: | #: To check the current list of halld tape volume sets: | ||
#:# Go to the [https://scicomp.jlab.org SciComp Home page] | #:# Go to the [https://scicomp.jlab.org SciComp Home page] | ||
− | #:# On the left side menu, in the "Tape Library" section, click on "[https://scicomp.jlab.org/scicomp/#/tapeUsageAccumulation Accumulation]" | + | #:# On the left side menu, in the "Tape Library" section, click on Usage->"[https://scicomp.jlab.org/scicomp/#/tapeUsageAccumulation Accumulation]" |
#:# In the main content part of the window, you should now be able to click the arrow next to "halld" to open a submenu | #:# In the main content part of the window, you should now be able to click the arrow next to "halld" to open a submenu | ||
#:# In the submenu, click the arrow next to "raw" to see the existing volume sets | #:# In the submenu, click the arrow next to "raw" to see the existing volume sets | ||
− | #:# n.b. | + | #:#: (n.b. The sets listed will be limited to those written to during the dates shown in the calendar in the upper right corner.) |
# Modify the file ''/gluex/etc/hdonline.cshrc'' to set ''RUN_PERIOD'' to the new Run Period name. | # Modify the file ''/gluex/etc/hdonline.cshrc'' to set ''RUN_PERIOD'' to the new Run Period name. | ||
− | # Create the Run Period directory on '''all''' raid disks (gluonraid1 , gluonraid2, gluonraid3) by logging in as ''hdops'' and doing the following: | + | # Create the Run Period directory on '''all''' raid disks (gluonraid1 , gluonraid2, gluonraid3, and gluonraid4) by logging in as ''hdops'' and doing the following: |
#: for gluonraid1 and gluonraid2: | #: for gluonraid1 and gluonraid2: | ||
#: <pre>> mkdir /raid/rawdata/active/$RUN_PERIOD</pre> | #: <pre>> mkdir /raid/rawdata/active/$RUN_PERIOD</pre> | ||
Line 124: | Line 124: | ||
#: <pre> /gluex/etc/crontabs/crontab.root.gluonraid2</pre> | #: <pre> /gluex/etc/crontabs/crontab.root.gluonraid2</pre> | ||
#: when done, the relevant line2 should look something like this: | #: when done, the relevant line2 should look something like this: | ||
− | #: <pre> */10 * * * * /root/jmigrate /raid/rawdata/staging/RunPeriod- | + | #: <pre> */10 * * * * /root/jmigrate /raid/rawdata/staging/RunPeriod-2019-01 /raid/rawdata/staging /mss/halld </pre> |
− | #: this will run the ''jmigrate'' script every 10 minutes, copying every file found in the ''/raid/rawdata/staging/RunPeriod- | + | #: this will run the ''jmigrate'' script every 10 minutes, copying every file found in the ''/raid/rawdata/staging/RunPeriod-2019-01'' directory to the tape library, preserving the directory structure, but replacing the leading ''/raid/rawdata/staging'' with ''/mss/halld''. It will also unlink the files and remove any empty directories. |
#: install it '''''as root''''' on '''both''' gluonraid1 and gluonraid2 via: | #: install it '''''as root''''' on '''both''' gluonraid1 and gluonraid2 via: | ||
#: <pre># crontab /gluex/etc/crontabs/crontab.root.gluonraid </pre> | #: <pre># crontab /gluex/etc/crontabs/crontab.root.gluonraid </pre> |
Revision as of 12:41, 2 January 2019
Contents
[hide]The basics
Getting data to the tape library
Files are copied to the tape library in the Computer Center (bottom floor of F-wing) via a multi-stage process. The details are as follows:
- The DAQ system will write data to one of the 4 partitions on gluonraid3
- The partition being written to is changed every run by the script $DAQ_HOME/scripts/run_prestart_sync
- It does this by running /gluex/builds/devel/packages/raidUtils/hd_rotate_raid_links.py which updates the links:
/gluex/data/rawdata/prev <- Link to partition where previous run was written /gluex/data/rawdata/curr <- Link to partition where current(most recent) run is(was) written /gluex/data/rawdata/next <- Link to partition where next run will be written
- Each partition has 3 directory trees used to maintain the data in various stages as it is copied to tape and kept for potential offline analysis within the gluon cluster
/gluonraid3/dataX/rawdata/active <- directory tree data is wrritten to by DAQ /gluonraid3/dataX/rawdata/volatile <- directory tree is moved to for later analysis on gluons /gluonraid3/dataX/rawdata/staging <- directory tree with files hard linked to volatile for copying to tape
- A series of cron jobs on gluonraid3 in the hdsys account moves and links the data among these directories.
- These cron jobs are based on 4 scripts:
/gluex/builds/devel/packages/raidUtils/hd_stage_to_tape.py /gluex/builds/devel/packages/raidUtils/hd_link_rundirs.py /gluex/builds/devel/packages/raidUtils/hd_disk_map_and_free.py /gluex/builds/devel/packages/raidUtils/hd_copy_sample.py
- Current details of how these do this should be referred to the scripts themselves which have extensive comments at the top describing what they do. They are located [in subversion]. Here is an overview:
hd_stage_to_tape.py :
This script will search for completed runs in the "active" directories of all partitions. For any it finds it will:
- Move the run directory to "volatile"
- Create a tar file of any subdirectories (e.g. RunLog, monitoring, ...)
- Make a hard link to every evio file and tar file in the "staging" directory
- Run the jmigrate program (provided by Scientific Computing Group) to copy files from "staging" to tape
hd_link_rundirs.py
This script will make a symbolic link in /gluex/data/rawdata/all for each run directory. It does this for all partitions so there is a single location one needs to look to find all runs currently available across all partitions. It will also remove dead links pointing to run directories that no longer exist.
hd_disk_map_and_free.py
This script will identify partitions not currently in use (or in potential danger of being in use soon) and run the map_disk.py utility on those. It will also run the map_disk_autodelete.py utility to ensure adequate disk space is available for another run.
hd_copy_sample.py
This script will copy the first 3 data files from each run into the "volatile" directory on gluonraid2. This will allow a sample of files for all runs in a RunPeriod to be kept on the gluon cluster.
- Files will be written to subdirectories containing both the RUN_PERIOD and run number. For example:
- /gluonraid3/data1/rawdata/active/$RUN_PERIOD/rawdata/RunXXXXXX
- where RUN_PERIOD is an environment variable set in the /gluex/etc/hdonline.cshrc script and XXXXXX is the 6-digit, zero-padded run number.
- The RunXXXXXX directory will contain:
- all raw data files for the run
- a tar file of the DAQ settings
- a monitoring directory containing the online monitoring histograms generated for the run.
- Stub files referring to the tape-resident files will be placed in the following directory on the JLab CUE:
- /mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX
- On the CUE (and therefore in the tape library), the files are owned by the halldata account.
Disk space on RAID
Standard running will use only the gluonraid3 sever. This is broken into 4 partitions of size 54TB each. However, based on the advise of Chip, we will routinely only use 80% of these disks in order to optimize the write/read rates by only utilizing the outer portions of the disks. Further, when a partition is not currently in use, it is subject to preparation for the next run which means an additional 5.76TB of space will be freed. (the 5.76TB is based on 800MB/s for a 2 hour run). Thus, 37.6TB of space will be in use on each partition once the disk is initially filled.
In addition, there are two 72TB RAID disks in the counting house named gluonraid1 and gluonraid2. (They actually report to be 77TB, but only about 72-73TB is usable.) These are primarily used for storing secondary data streams (PXI data from the magnet, sample data from each run, ...). However, the space will be used for additional storage if the need arises such as losing the connection to the tape library.
Monitoring available space
- The currently selected RAID disk should always be linked from /gluex/raid
- The current RAID disk is monitored via the EPICS alarm system. The remaining space is written to an EPICS variable via the program hd_raidmonitor.py. This is always run by the hdops account on gluonraid1 regardless of whether it is the currently selected disk. This actually updates two EPICS variables:
- HD:coda:daq:availableRAID - The remaining disk space on the RAID in TB
- HD:coda:daq:heartbeat - Updated every 2 seconds to switch between 0 and 1 indicating the hd_raidmonitor.py program is still running
- The system is set to raise a warning if the available space drops below a few TB (ask Hovanes for exact number)
- The system is set to raise an error if the available space drops below ??TB or the heartbeat stops updating
- Note that at this time, the hd_raidmonitor.py is not automatically started by any system. Therefore, if gluonraid1 is rebooted, the program will need to be restarted "by-hand"
- To check if the hd_raidmonitor.py program is running, enter the following:
-
> $DAQ_HOME/tools/raidutils/hd_raidmonitor_test/watch_raidmonitor.py
-
- To start the hd_raidmonitor.py utility do this
-
> $DAQ_HOME/tools/raidutils/hd_raidmonitor.py >& /dev/null &
-
- Source for Hall-D specific tools used for managing the RAID are kept in svn here:
- A second cron job is run on gluonraid1 from the hdsys account to remove files from the gluonraid1/rawdata/volatile directory as needed to ensure disk space is available
- (n.b. at the time of this writing, the second cron job has not been set up!)
Changing Run Periods
We will update to a new Run Period at the beginning of each beam period. This means in between beam times when we take mainly cosmic data, data will go into the previous Run Period's directory. A month or two before a new Run Period, the volume sets should get set up so that cosmic and calibration data taken just prior to the run will go to the right place. To begin a new run period one must do the following:
- Make sure no one is running the DAQ or is at least aware that the run period (and output directory) is about to change. They will probably want to log out and back in after you're done to make sure the environment in all windows is updated.
- Submit a CCPR requesting two new tape volume sets be created. The "category" should be set to "Scientific Computing" and the subject something like "New tape volume sets for Hall-D". An example of a previous request is:
- Hi,
We would like to have the following two tape volume sets created please:
/mss/halld/RunPeriod-2019-01/rawdata raw
/mss/halld/RunPeriod-2019-01 production
n.b. some details on how Dave Rackley did this before can be found in CCPR 121062
Thanks, - You should specify whether the volume is to be flagged "raw" or "production". The "raw" volumes are will be automatically duplicated by the Computer Center. The "production" volumes will not. We generally want all Run Periods to be "raw", even times when only cosmic data is taken as that is considered a valuable snapshot of detector performance at a specific point in time.
- To check the current list of halld tape volume sets:
- Go to the SciComp Home page
- On the left side menu, in the "Tape Library" section, click on Usage->"Accumulation"
- In the main content part of the window, you should now be able to click the arrow next to "halld" to open a submenu
- In the submenu, click the arrow next to "raw" to see the existing volume sets
- (n.b. The sets listed will be limited to those written to during the dates shown in the calendar in the upper right corner.)
-
- Modify the file /gluex/etc/hdonline.cshrc to set RUN_PERIOD to the new Run Period name.
- Create the Run Period directory on all raid disks (gluonraid1 , gluonraid2, gluonraid3, and gluonraid4) by logging in as hdops and doing the following:
- for gluonraid1 and gluonraid2:
-
> mkdir /raid/rawdata/active/$RUN_PERIOD
-
> chmod o+w /raid/rawdata/active/$RUN_PERIOD
- for gluonraid3 and gluonraid4:
-
> mkdir /data1/rawdata/active/$RUN_PERIOD
-
> mkdir /data2/rawdata/active/$RUN_PERIOD
-
> mkdir /data3/rawdata/active/$RUN_PERIOD
-
> mkdir /data4/rawdata/active/$RUN_PERIOD
-
> chmod o+w /data?/rawdata/active/$RUN_PERIOD
- Modify the crontab for root on both gluonraid1 and gluonraid2. Do this by editing the following files (as hdsys) to reflect the new Run Period directory:
-
/gluex/etc/crontabs/crontab.root.gluonraid1
-
/gluex/etc/crontabs/crontab.root.gluonraid2
- when done, the relevant line2 should look something like this:
-
*/10 * * * * /root/jmigrate /raid/rawdata/staging/RunPeriod-2019-01 /raid/rawdata/staging /mss/halld
- this will run the jmigrate script every 10 minutes, copying every file found in the /raid/rawdata/staging/RunPeriod-2019-01 directory to the tape library, preserving the directory structure, but replacing the leading /raid/rawdata/staging with /mss/halld. It will also unlink the files and remove any empty directories.
- install it as root on both gluonraid1 and gluonraid2 via:
-
# crontab /gluex/etc/crontabs/crontab.root.gluonraid
-
- n.b. you do NOT need to update the root crontab on gluonraid3 or gluonraid4. Those run jmigrate from the hdsys account cronjob via the hd_stage_to_tape.py script.
- Change the run number by rounding up to the next multiple of 10,000.
- Make sure the DAQ system is not running
- Using the hdops account, edit the file $COOL_HOME/hdops/ddb/controlSessions.xml and set the new run number.