Difference between revisions of "Offline Monitoring Incoming Data"

From GlueXWiki
Jump to: navigation, search
(Preparing the software)
m (Launching for a new run period)
 
(28 intermediate revisions by 2 users not shown)
Line 19: Line 19:
 
== Preparing the software ==
 
== Preparing the software ==
  
1. Update the environment, using the latest desired versions of JANA, the CCDB, etc.  Also, the launch software will create new tags of the HDDS and sim-recon repositories, so update the version*.xml file referenced in the environment file to use the soon-to-be-created tags.  This must be done <b>BEFORE</b> launch project creation. The environment file is at:
+
* Do the exact same steps as detailed for the [[Offline_Monitoring_Archived_Data | offline monitoring and reconstruction setup]] EXCEPT the following.  
<pre>~/env_monitoring_incoming</pre>
+
  
2. Setup the environment.  This will override the HDDS and sim-recon in the version*.xml file and will instead use the monitoring launch working-area builds. Call: <pre>source ~/env_monitoring_incoming</pre>
+
'''1)''' Replace <span style="color:red">"monitoring_launch"</span> with <span style="color:red">"monitoring_incoming."</span>
  
3. Updating & building hdds:  
+
'''2)''' The software should be built with a different directory name (e.g. <span style="color:red">"build1"</span>) instead of <span style="color:red">"monitoring_incoming."</span>  And then a soft link should be created:
 
<pre>
 
<pre>
cd $HDDS_HOME
+
ln -s build1 monitoring_incoming
git pull                # Get latest software
+
scons -c install        # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
+
scons install -j4      # Rebuild and re-install with 4 threads
+
 
</pre>
 
</pre>
 +
This way, if the software needs to be updated in the middle of the run, you just create a new build in parallel (e.g. <span style="color:red">"build2"</span>) and then switch the symbolic links when you're ready.
  
4. Updating & building sim-recon:
+
'''3)''' Don't create a CCDB sqlite file. These will be created uniquely for each job, so that each job has the most up-to-date calibration constants.
<pre>
+
cd $HALLD_HOME/src
+
git pull                # Get latest software
+
scons -c install        # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
+
scons install -j4      # Rebuild and re-install with 4 threads
+
</pre>
+
  
5. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
+
== Starting A New Run Period ==
<pre>
+
cd $GLUEX_MYTOP/../sqlite/
+
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
+
mv ccdb.sqlite ccdb_monitoring_incoming.sqlite #replacing the old file
+
</pre>
+
  
== Setup for a new run period ==
+
* Do the exact same steps as detailed in "Starting a new run period" at [[Offline_Monitoring_Archived_Data|Link]]
  
1) Download the "monitoring" scripts directory from svn. For the gxprojN accounts, use the directory ~/monitoring/:
+
== Launching for a new run period ==
 +
 
 +
'''1)''' Download the "monitoring" scripts directory from svn. For the gxprojN accounts, use the directory ~/monitoring/:
 
<pre>
 
<pre>
 
cd ~/
 
cd ~/
Line 56: Line 44:
 
</pre>
 
</pre>
  
2) Edit the job config file, ~/monitoring/incoming/input.config, which is used to register jobs in hdswif. The version # should be "01."  A typical config file will look this:
+
'''2)''' Update the '''<span style="color:red">jobs_incoming.config</span>''' job config file. Definitely be sure to update '''<span style="color:red">RUNPERIOD</span>'''. Monitoring of the incoming data should always be '''<span style="color:red">ver01</span>'''.
PROJECT                      gluex
+
<pre>
TRACK                        reconstruction
+
vi ~/monitoring/incoming/jobs_incoming.config
OS                            centos65
+
</pre>
NCORES                        24        # 24 = Entire node
+
DISK                          40
+
RAM                          32        # 32 GB per node
+
TIMELIMIT                      8
+
JOBNAMEBASE                  offmon
+
RUNPERIOD                    2016-02
+
VERSION                      01
+
OUTPUT_TOPDIR                /cache/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
+
SCRIPTFILE                    /home/gxproj1/monitoring/incoming/script.sh                        # Must specify full path
+
ENVFILE                      /home/gxproj1/env_monitoring_incoming                              # Must specify full path
+
  
3) Create a new swif workflow for running all of the incoming data (e.g. <workflow> = offline_monitoring_RunPeriod2016_02_ver01_hd_rawdata):
+
'''3)''' Update the '''<span style="color:red">jana_incoming.config</span>''' jana config file.  This contains the command line arguments given to JANA. Definitely be sure to update '''<span style="color:red">REST:DATAVERSIONSTRING</span>'''.
 
<pre>
 
<pre>
~/monitoring/hdswif/hdswif.py create [workflow] -c ~/monitoring/incoming/input.config
+
vi ~/monitoring/incoming/jana_incoming.config
 
</pre>
 
</pre>
  
4) If not already running, the cron job can be launched by running:
+
'''4)''' Create the SWIF workflow.  The workflow should have a name like '''<span style="color:red">"offmon_2016-10_ver01"</span>.''' It should also match the workflow name in the job config file (e.g. jobs_incoming.config).
 
<pre>
 
<pre>
crontab cron_incoming
+
swif create -workflow <my_workflow>
 
</pre>
 
</pre>
  
<!--
+
'''5)''' In ~/monitoring/incoming/cron_exec.csh, modify the script to run for the new run period  E.g. for 2016-02:
=== Running the cron job ===
+
<pre>
 +
~/monitoring/incoming/cron_exec.csh
 +
</pre>
  
'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.  
+
'''6)''' Before launching the cron job, manually run the script first. This is just in case there are already a lot of files on disk, and it takes longer than 15 minutes to run the first execution. In this case, jobs may be double-submitted! So, first execute the python script manually (this submits jobs for the first 5 files (000 -> 004) of every run that are on /mss/, but haven't been submitted yet):
 +
<pre>
 +
python ~/monitoring/incoming/submit_jobs.py 2016-10 ~/monitoring/incoming/jobs_incoming.config 5 >& ~/incoming_log.txt
 +
</pre>
  
* Go to the cron job directory:  
+
'''7)''' Update the script for post-processing for the new run period:
 
<pre>
 
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
+
~/monitoring/process/check_monitoring_data.csh
 
</pre>
 
</pre>
  
* The cron_plugins file is the cronjob that will be executed.  During execution, it runs the exec.sh command in the same folder.  This command takes two arguments: the project name, and the maximum file number for each run.  These fields should be updated in the cron_plugins file before running.  
+
'''8)''' Add the incoming data to the data version database
 +
<pre>
 +
~/monitoring/process/register_new_version.py add ~/monitoring/process/version/incoming_2016-10_ver01
 +
</pre>
  
* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number.  It then submits jobs for these files. 
+
'''9)''' Check if the cron demon is running on that node:
 +
<pre>
 +
ps aux | grep crond
 +
</pre>
  
* To start the cron job, run:
+
'''10)''' Now that the initial batch of jobs have been submitted, launch the cron job by running:
 
<pre>
 
<pre>
crontab cron_plugins
+
crontab cron_incoming
 
</pre>
 
</pre>
  
* To check whether the cron job is running, do
+
'''11)''' To check whether the cron job is running (on the same machine you launched the cron job, i.e. for CentOS7: ifarm1401 or ifarm1402), do
 
<pre>
 
<pre>
 
crontab -l
 
crontab -l
 
</pre>
 
</pre>
  
* To remove the cron job do
+
'''12)''' The stdout & stderr from the cronjob are piped to a log file located at:
 +
<pre>
 +
~/incoming.log
 +
</pre>
 +
and
 +
<pre>
 +
~/check.log
 +
</pre>
 +
 
 +
'''13)''' Periodically check how the jobs are doing, and modify and resubmit failed jobs as needed (where <problem> can be one of '''SYSTEM, TIMEOUT, RLIMIT'''):
 +
<pre>
 +
swif status <workflow>
 +
~/monitoring/hdswif/hdswif.py resubmit <workflow> <problem>
 +
</pre>
 +
 
 +
'''14)''' To remove the cron job (e.g. at the end of the run) do
 
<pre>
 
<pre>
 
crontab -r
 
crontab -r
 
</pre>
 
</pre>
-->
 

Latest revision as of 16:19, 29 October 2019

Saving Online Monitoring Data

The procedure for writing the data out is given in, e.g., Raid-to-Silo Transfer Strategy.

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape, and within ~20 min., we will have access to the file on tape at /mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken. They will be accessible within the counting house via RootSpy, and for each run and file, a ROOT file containing the histograms will be saved within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly from the counting house, or the tape files will be available within ~20 min. of the file being written out.

Preparing the software

1) Replace "monitoring_launch" with "monitoring_incoming."

2) The software should be built with a different directory name (e.g. "build1") instead of "monitoring_incoming." And then a soft link should be created:

ln -s build1 monitoring_incoming

This way, if the software needs to be updated in the middle of the run, you just create a new build in parallel (e.g. "build2") and then switch the symbolic links when you're ready.

3) Don't create a CCDB sqlite file. These will be created uniquely for each job, so that each job has the most up-to-date calibration constants.

Starting A New Run Period

  • Do the exact same steps as detailed in "Starting a new run period" at Link

Launching for a new run period

1) Download the "monitoring" scripts directory from svn. For the gxprojN accounts, use the directory ~/monitoring/:

cd ~/
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/
cd monitoring/incoming

2) Update the jobs_incoming.config job config file. Definitely be sure to update RUNPERIOD. Monitoring of the incoming data should always be ver01.

vi ~/monitoring/incoming/jobs_incoming.config

3) Update the jana_incoming.config jana config file. This contains the command line arguments given to JANA. Definitely be sure to update REST:DATAVERSIONSTRING.

vi ~/monitoring/incoming/jana_incoming.config

4) Create the SWIF workflow. The workflow should have a name like "offmon_2016-10_ver01". It should also match the workflow name in the job config file (e.g. jobs_incoming.config).

swif create -workflow <my_workflow>

5) In ~/monitoring/incoming/cron_exec.csh, modify the script to run for the new run period E.g. for 2016-02:

~/monitoring/incoming/cron_exec.csh

6) Before launching the cron job, manually run the script first. This is just in case there are already a lot of files on disk, and it takes longer than 15 minutes to run the first execution. In this case, jobs may be double-submitted! So, first execute the python script manually (this submits jobs for the first 5 files (000 -> 004) of every run that are on /mss/, but haven't been submitted yet):

python ~/monitoring/incoming/submit_jobs.py 2016-10 ~/monitoring/incoming/jobs_incoming.config 5 >& ~/incoming_log.txt

7) Update the script for post-processing for the new run period:

~/monitoring/process/check_monitoring_data.csh

8) Add the incoming data to the data version database

~/monitoring/process/register_new_version.py add ~/monitoring/process/version/incoming_2016-10_ver01

9) Check if the cron demon is running on that node:

ps aux | grep crond

10) Now that the initial batch of jobs have been submitted, launch the cron job by running:

crontab cron_incoming

11) To check whether the cron job is running (on the same machine you launched the cron job, i.e. for CentOS7: ifarm1401 or ifarm1402), do

crontab -l

12) The stdout & stderr from the cronjob are piped to a log file located at:

~/incoming.log

and

~/check.log

13) Periodically check how the jobs are doing, and modify and resubmit failed jobs as needed (where <problem> can be one of SYSTEM, TIMEOUT, RLIMIT):

swif status <workflow>
~/monitoring/hdswif/hdswif.py resubmit <workflow> <problem>

14) To remove the cron job (e.g. at the end of the run) do

crontab -r