Difference between revisions of "Data Monitoring Procedures"

From GlueXWiki
Jump to: navigation, search
(Post-Processing Procedures)
(Procedures: Details)
(42 intermediate revisions by 5 users not shown)
Line 19: Line 19:
  
 
=== Monitoring Webpages ===
 
=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
+
*[https://halldweb.jlab.org/wiki/index.php/Monitoring_webpage_help Help]
 +
*[https://halldweb.jlab.org/data_monitoring/Plot_Browser.html Plot Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/versionBrowser.py Version Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/versionBrowser.py Version Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/recontestBrowser.py Recon Tests]
  
 
== SciComp Job Links ==
 
== SciComp Job Links ==
Line 29: Line 31:
 
* [https://scicomp.jlab.org/scicomp/ Scientific Computing Home Page]
 
* [https://scicomp.jlab.org/scicomp/ Scientific Computing Home Page]
 
* [https://scicomp.jlab.org/scicomp/#/auger/jobs Auger Job Status Page]
 
* [https://scicomp.jlab.org/scicomp/#/auger/jobs Auger Job Status Page]
* [https://scicomp.jlab.org/scicomp/#/jasmine/jobs Jasmine Tape Job Status Page]
+
* [https://scicomp.jlab.org/scicomp/#/jasmine/jobs JasMine Tape Job Status Page]
  
 
=== Documentation ===
 
=== Documentation ===
 
* [https://scicomp.jlab.org/docs/batch Batch System]
 
* [https://scicomp.jlab.org/docs/batch Batch System]
 
* [https://scicomp.jlab.org/docs/storage Mass Storage System]
 
* [https://scicomp.jlab.org/docs/storage Mass Storage System]
 +
* [https://scicomp.jlab.org/docs/write-through-cache Write-Through Cache]
 
* [https://scicomp.jlab.org/docs/swif SWIF]
 
* [https://scicomp.jlab.org/docs/swif SWIF]
 
* [https://scicomp.jlab.org/docs/swif-cli SWIF Command Line]
 
* [https://scicomp.jlab.org/docs/swif-cli SWIF Command Line]
Line 44: Line 47:
 
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
 
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
  
== Procedures ==
+
== Procedures: Overview ==
  
=== Saving Online Monitoring Data ===
+
=== Online Monitoring: During Experimental Running ===
  
The procedure for writing the data out is given in, e.g.,
+
After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].
+
  
Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
+
This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.
and within ~20 min., we will have access to the file on tape at
+
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.
+
  
All online monitoring plugins will be run as data is taken.
+
For more details on the online monitoring system, see [https://halldweb.jlab.org/hdops/wiki/index.php/Online_Monitoring_Shift  this page].
They will be accessible within the counting house via RootSpy, and
+
for each run and file, a ROOT file containing the histograms will be saved
+
within a subdirectory for each run.
+
  
For immediate access to these files, the raid disk files may be accessed directly
+
=== Offline Monitoring and Reconstruction: During Experimental Running ===
from the counting house, or the tape files will be available within ~20 min. of the
+
file being written out.
+
  
=== More Procedure Links ===
+
During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
  
* [[Offline_Monitoring_Archived_Data | Offline Monitoring: Running Over Archived Data]]
+
# '''Incoming:''' Monitor the first <span style="color:red">5</span> files of each newly-recorded run as soon as it hits the tape.
* [[Offline_Monitoring_Post_Processing | Offline Monitoring: Post-Processing]]
+
# '''Monitoring Launches:''' Every <span style="color:red">two</span> weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.
 +
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
  
== Running Over Data As It Comes In ==
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with.  Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.
  
A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
+
=== Offline Monitoring and Reconstruction: After Experimental Running ===
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
+
run the previous Friday. The procedure for this is shown below.
+
  
<!--
+
After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
=== Setting up the environment ===
+
The file
+
/home/gxproj1/setup_jlab.csh
+
is sourced through .tcshrc.
+
This file is the same as what is linked to by
+
/home/gluex/setup_jlab_commissioning.csh,
+
except HALLD_HOME, HDDS_HOME, and JANA_CALIB_URL are set separately so that this
+
user can have a separate build.
+
  
To obtain the builds from the previous Friday's runs,
+
# '''Monitoring Launches:''' Every two weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.  
execute
+
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
/home/gxproj1/halld/monitoring/newruns/setup_previous.sh [year] [month] [day]
+
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
The build revisions from the previous Friday are archived in files
+
# '''Further Reconstruction Launches:''' Every <span style="color:red">~3</span> months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data.
/work/halld/data_monitoring/run_conditions/soft_comm_[year]_[month]_[day].xml
+
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
and the script will build libraries based on those stored revision numbers.
+
-->
+
  
<!--
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, since there will be a significant amount of data.
==== Using cron to run automatically ====
+
Within /home/gluex/halld/monitoring/cron/ there is a file cron_plugins
+
that can be executed via
+
crontab cron_plugins
+
This will set up a cron job to call the script scan_for_jobs.sh, which will
+
check in the rawdata directory and call generatejobs_plugins_rawdata.sh for
+
any run that is more than 5 min old. The cron job is set up to run every 10 min.
+
-->
+
  
=== Running the cron job ===
+
=== Saving to Tape (Write-through Cache): Monitoring Launches ===
 +
All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:
 +
* REST files: All files.
 +
* ROOT files: One merged file per run.
 +
** After merge, the individual files are deleted (so they won't be saved).
 +
* Job stdout/stderr: One tarball per run
 +
** After launch analysis, the log files are deleted (so they won't be saved).
 +
* Browser png's: One tarball per launch
  
'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.  
+
=== Saving to Tape (Write-through Cache): Full Reconstruction Launches ===
 +
* REST files: All files.  
 +
* ROOT files: All files, <span style="color:blue">AND</span> one merged file per run.
 +
* Job stdout/stderr: One tarball per run
 +
** After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).  
 +
* Browser png's: One tarball per launch
  
* Go to the cron job directory:
+
== Procedures: Details ==
<pre>
+
cd /u/home/gxproj1/halld/monitoring/newruns
+
</pre>
+
 
+
* The cron_plugins file is the cronjob that will be executed.  During execution, it runs the exec.sh command in the same folder.  This command takes two arguments: the project name, and the maximum file number for each run.  These fields should be updated in the cron_plugins file before running.
+
 
+
* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number.  It then submits jobs for these files. 
+
 
+
* To start the cron job, run:
+
<pre>
+
crontab cron_plugins
+
</pre>
+
 
+
* To check whether the cron job is running, do
+
<pre>
+
crontab -l
+
</pre>
+
 
+
* To remove the cron job do
+
<pre>
+
crontab -r
+
</pre>
+
 
+
<!--
+
The cron job will run the script scan_for_jobs.sh,
+
which runs generatejobs_plugins_rawdata.sh for any
+
new runs that it had not seen before. All previous
+
runs are recorded in the file filelists/files_current.txt
+
so clear this to run over runs, or set the parameters
+
MINRUN and MAXRUN which will set the range of runs submitted.
+
-->
+
 
+
==Data Versions==
+
 
+
To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information.  The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.
+
 
+
We store one record per pass through one run period, with the following structure:
+
 
+
{| class="wikitable"
+
! Field !! Description
+
|-
+
| data_type || The level of data we are processing.  For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
+
|-
+
| run_period || The run period of the data
+
|-
+
| revision || An integer specifying which pass through the run period this data corresponds to
+
|-
+
| software_version || The name of the XML file that specifies the different software versions used
+
|-
+
| jana_config  || The name of the text file that specifies which JANA options were passed to the reconstruction program
+
|-
+
| ccdb_context  || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
+
|-
+
| production_time  || The data at which monitoring/reconstruction began
+
|-
+
| dataVersionString  || A convenient string for identifying this version of the data
+
|}
+
  
 +
* [[Offline_Monitoring_Incoming_Data | Offline Monitoring: Running Over Incoming Data]]
 +
* [[Offline_Monitoring_Archived_Data | Offline Monitoring: Running Over Archived Data]]
 +
* [[Offline_Monitoring_Post_Processing | Offline Monitoring: Post-Processing]]
 +
* [[Offline_Monitoring_Data_Validation | Offline Monitoring: Data Validation]]
 +
** [[Online_Monitoring_Data_Validation | Online Monitoring: Data Validation]]
 +
* [[DEPRECATED_Offline_Monitoring_Archived_Data | DEPRECATED (Except plots): Offline Monitoring: Running Over Archived Data]]
 +
* [[DSelector_SWIF_Jobs | DSelector SWIF Jobs]]
 +
* [[Merging_Analysis_Trees | Analysis Launch: Merging Trees]]
  
An example file used as as input to ./register_new_version.py is:
+
== Software Tests ==
<syntaxhighlight>
+
* [[Software_Test_Data_Recon | Software Test: Experimental Data Reconstruction]]
data_type          = recon
+
** [https://halldweb.jlab.org/recon_test/ Test Results]
run_period          = RunPeriod-2014-10
+
revision            = 1
+
software_version    = soft_comm_2014_11_06.xml
+
jana_config        = jana_rawdata_comm_2014_11_06.conf
+
ccdb_context        = calibtime=2014-11-10
+
production_time    = 2014-11-10
+
dataVersionString  = recon_RunPeriod-2014-10_20141110_ver01
+
</syntaxhighlight>
+

Revision as of 09:22, 17 March 2020

Master List of File / Database / Webpage Locations

Run Conditions

  • Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
  • Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
  • Run Info vers. 1
  • Run Info vers. 2
  • RCDB

Monitoring Output Files

  • Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
  • Online monitoring histograms: /work/halld/online_monitoring/root/
  • Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
  • individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

Monitoring Database

  • Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

Monitoring Webpages

SciComp Job Links

Main

Documentation

Job Tracking

Procedures: Overview

Online Monitoring: During Experimental Running

After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.

This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.

For more details on the online monitoring system, see this page.

Offline Monitoring and Reconstruction: During Experimental Running

During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Incoming: Monitor the first 5 files of each newly-recorded run as soon as it hits the tape.
  2. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  3. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with. Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.

Offline Monitoring and Reconstruction: After Experimental Running

After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  2. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
  3. Further Reconstruction Launches: Every ~3 months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, since there will be a significant amount of data.

Saving to Tape (Write-through Cache): Monitoring Launches

All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:

  • REST files: All files.
  • ROOT files: One merged file per run.
    • After merge, the individual files are deleted (so they won't be saved).
  • Job stdout/stderr: One tarball per run
    • After launch analysis, the log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch

Saving to Tape (Write-through Cache): Full Reconstruction Launches

  • REST files: All files.
  • ROOT files: All files, AND one merged file per run.
  • Job stdout/stderr: One tarball per run
    • After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch

Procedures: Details

Software Tests