Difference between revisions of "Data Monitoring Procedures"

From GlueXWiki
Jump to: navigation, search
(Generating an offline plugin job)
(Procedures: Details)
(251 intermediate revisions by 8 users not shown)
Line 1: Line 1:
 
__TOC__
 
__TOC__
  
==Saving Online Monitoring Data==
+
== Master List of File / Database / Webpage Locations ==
 +
=== Run Conditions ===
 +
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
 +
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
 +
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]
 +
*[https://halldweb.jlab.org/rcdb RCDB]
  
The procedure for writing the data out is given in, e.g.,
+
=== Monitoring Output Files ===
[https://halldweb1.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].
+
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
 +
* Online monitoring histograms: /work/halld/online_monitoring/root/
 +
* Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
 +
* individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/
  
Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
+
=== Monitoring Database ===
and within ~20 min., we will have access to the file on tape at
+
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.
+
  
All online monitoring plugins will be run as data is taken.
+
=== Monitoring Webpages ===
They will be accessible within the counting house via RootSpy, and
+
*[https://halldweb.jlab.org/wiki/index.php/Monitoring_webpage_help Help]
for each run and file, a ROOT file containing the histograms will be saved
+
*[https://halldweb.jlab.org/data_monitoring/Plot_Browser.html Plot Browser]
within a subdirectory for each run.
+
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/versionBrowser.py Version Browser]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 +
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/recontestBrowser.py Recon Tests]
  
For immediate access to these files, the raid disk files may be accessed directly
+
== SciComp Job Links ==
from the counting house, or the tape files will be available within ~20 min. of the
+
=== Main ===
file being written out.
+
* [https://scicomp.jlab.org/scicomp/ Scientific Computing Home Page]
 +
* [https://scicomp.jlab.org/scicomp/#/auger/jobs Auger Job Status Page]
 +
* [https://scicomp.jlab.org/scicomp/#/jasmine/jobs JasMine Tape Job Status Page]
  
==Running Over Archived Data==
+
=== Documentation ===
 +
* [https://scicomp.jlab.org/docs/batch Batch System]
 +
* [https://scicomp.jlab.org/docs/storage Mass Storage System]
 +
* [https://scicomp.jlab.org/docs/write-through-cache Write-Through Cache]
 +
* [https://scicomp.jlab.org/docs/swif SWIF]
 +
* [https://scicomp.jlab.org/docs/swif-cli SWIF Command Line]
  
Once the files are written to take we can run the online plugins on these files to confirm what we were seeing in the online monitoring.
+
=== Job Tracking ===
Manual scripts, and eventually cron jobs can be set up to look for new run numbers and run the plugin over a sample of files.
+
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
 +
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
 +
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
 +
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
 +
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
  
=== Details of Offline Monitoring ===
+
== Procedures: Overview ==
  
Below are the procedures to
+
=== Online Monitoring: During Experimental Running ===
* run a single offline plugin job manually
+
* run a cron job to automate the process for new files
+
  
In principle these scripts should work, but if there are changes in
+
After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.
the directory structure for the rawdata files, or if there is a significant
+
increase in the memory or disk space necessary for the jobs, these should
+
be modified.
+
  
==== Generating an offline plugin job ====
+
This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.
Within /home/gluex/halld/monitoring/batch/ there will be scripts to run the online monitoring plugins over tape files.
+
The main script is generatejobs_plugins_rawdata.sh, which can be used as
+
generatejobs_plugins_rawdata.sh XXX
+
where XXX is the run #.
+
  
This will generate a script run_rawdata_XXXXXX.sh, where the run # has now been formatted to be 6 digits.
+
For more details on the online monitoring system, see [https://halldweb.jlab.org/hdops/wiki/index.php/Online_Monitoring_Shift  this page].
Executing this script will send the monitoring plugins job to the Auger batch system.
+
  
There is also a script clean.sh which can be used as
+
=== Offline Monitoring and Reconstruction: During Experimental Running ===
clean.sh XXX
+
This will clean up all associated files created in association with run XXX.
+
  
Internally, the xml file used to submit the job will be created, and the job to run
+
During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
will be given within script.sh. All run parameters should be specified in at the beginning
+
of generatejobs_plugins_rawdata.sh
+
Since we are running on tape, the tape file will first be copied over to the cache disk, and the job will run
+
over this cached file.
+
  
==== Using cron to run automatically ====
+
# '''Incoming:''' Monitor the first <span style="color:red">5</span> files of each newly-recorded run as soon as it hits the tape.
Within /home/gluex/halld/monitoring/cron/ there is a file cron_plugins
+
# '''Monitoring Launches:''' Every <span style="color:red">two</span> weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.
that can be executed via
+
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
crontab cron_plugins
+
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
This will set up a cron job to call the script scan_for_jobs.sh, which will
+
check in the rawdata directory and call generatejobs_plugins_rawdata.sh for
+
any run that is more than 5 min old. The cron job is set up to run every 10 min.
+
  
==Extracting Monitoring Data==
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with.  Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.
 +
 
 +
=== Offline Monitoring and Reconstruction: After Experimental Running ===
 +
 
 +
After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
 +
 
 +
# '''Monitoring Launches:''' Every two weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.
 +
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
 +
# '''Further Reconstruction Launches:''' Every <span style="color:red">~3</span> months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data. 
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
 +
 
 +
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, since there will be a significant amount of data.
 +
 
 +
=== Saving to Tape (Write-through Cache): Monitoring Launches ===
 +
All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:
 +
* REST files: All files.
 +
* ROOT files: One merged file per run.
 +
** After merge, the individual files are deleted (so they won't be saved).
 +
* Job stdout/stderr: One tarball per run
 +
** After launch analysis, the log files are deleted (so they won't be saved).
 +
* Browser png's: One tarball per launch
 +
 
 +
=== Saving to Tape (Write-through Cache): Full Reconstruction Launches ===
 +
* REST files: All files.
 +
* ROOT files: All files, <span style="color:blue">AND</span> one merged file per run.
 +
* Job stdout/stderr: One tarball per run
 +
** After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
 +
* Browser png's: One tarball per launch
 +
 
 +
 
 +
== Procedures: Details ==
 +
 
 +
* [[Offline_Monitoring_Incoming_Data | Offline Monitoring: Running Over Incoming Data]]
 +
* [[Offline_Monitoring_Archived_Data | Offline Monitoring: Running Over Archived Data]]
 +
* [[Offline_Monitoring_Post_Processing | Offline Monitoring: Post-Processing]]
 +
* [[DEPRECATED_Offline_Monitoring_Archived_Data | DEPRECATED (Except plots): Offline Monitoring: Running Over Archived Data]]
 +
* [[DSelector_SWIF_Jobs | DSelector SWIF Jobs]]
 +
* [[Merging_Analysis_Trees | Analysis Launch: Merging Trees]]
 +
 
 +
=== On- and Offline Monitoring Data Validation===
 +
* [[Offline_Monitoring_Data_Validation | Offline Monitoring: Data Validation]]
 +
* [[Online_Monitoring_Data_Validation | Online Monitoring: Data Validation]]
 +
 
 +
== Software Tests ==
 +
* [[Software_Test_Data_Recon | Software Test: Experimental Data Reconstruction]]
 +
** [https://halldweb.jlab.org/recon_test/ Test Results]

Revision as of 07:18, 21 October 2021

Master List of File / Database / Webpage Locations

Run Conditions

  • Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
  • Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
  • Run Info vers. 1
  • Run Info vers. 2
  • RCDB

Monitoring Output Files

  • Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
  • Online monitoring histograms: /work/halld/online_monitoring/root/
  • Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
  • individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

Monitoring Database

  • Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

Monitoring Webpages

SciComp Job Links

Main

Documentation

Job Tracking

Procedures: Overview

Online Monitoring: During Experimental Running

After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.

This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.

For more details on the online monitoring system, see this page.

Offline Monitoring and Reconstruction: During Experimental Running

During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Incoming: Monitor the first 5 files of each newly-recorded run as soon as it hits the tape.
  2. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  3. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with. Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.

Offline Monitoring and Reconstruction: After Experimental Running

After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  2. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
  3. Further Reconstruction Launches: Every ~3 months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, since there will be a significant amount of data.

Saving to Tape (Write-through Cache): Monitoring Launches

All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:

  • REST files: All files.
  • ROOT files: One merged file per run.
    • After merge, the individual files are deleted (so they won't be saved).
  • Job stdout/stderr: One tarball per run
    • After launch analysis, the log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch

Saving to Tape (Write-through Cache): Full Reconstruction Launches

  • REST files: All files.
  • ROOT files: All files, AND one merged file per run.
  • Job stdout/stderr: One tarball per run
    • After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch


Procedures: Details

On- and Offline Monitoring Data Validation

Software Tests