Difference between revisions of "Data Monitoring Procedures"

From GlueXWiki
Jump to: navigation, search
(Running Over Archived Data)
(On- and Offline Monitoring Data Validation)
(234 intermediate revisions by 8 users not shown)
Line 1: Line 1:
 
__TOC__
 
__TOC__
  
==Saving Online Monitoring Data==
+
== Master List of File / Database / Webpage Locations ==
 +
=== Run Conditions ===
 +
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
 +
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
 +
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]
 +
*[https://halldweb.jlab.org/rcdb RCDB]
  
The procedure for writing the data out is given in, e.g.,
+
=== Monitoring Output Files ===
[https://halldweb1.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].
+
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
 +
* Online monitoring histograms: /work/halld/online_monitoring/root/
 +
* Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
 +
* individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/
  
Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
+
=== Monitoring Database ===
and within ~20 min., we will have access to the file on tape at
+
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.
+
  
All online monitoring plugins will be run as data is taken.
+
=== Monitoring Webpages ===
They will be accessible within the counting house via RootSpy, and
+
*[https://halldweb.jlab.org/wiki/index.php/Monitoring_webpage_help Help]
for each run and file, a ROOT file containing the histograms will be saved
+
*[https://halldweb.jlab.org/data_monitoring/Plot_Browser.html Plot Browser]
within a subdirectory for each run.
+
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/versionBrowser.py Version Browser]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 +
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/recontestBrowser.py Recon Tests]
  
For immediate access to these files, the raid disk files may be accessed directly
+
== SciComp Job Links ==
from the counting house, or the tape files will be available within ~20 min. of the
+
=== Main ===
file being written out.
+
* [https://scicomp.jlab.org/scicomp/ Scientific Computing Home Page]
 +
* [https://scicomp.jlab.org/scicomp/#/auger/jobs Auger Job Status Page]
 +
* [https://scicomp.jlab.org/scicomp/#/jasmine/jobs JasMine Tape Job Status Page]
  
==Running Over Data As It Comes In==
+
=== Documentation ===
 +
* [https://scicomp.jlab.org/docs/batch Batch System]
 +
* [https://scicomp.jlab.org/docs/storage Mass Storage System]
 +
* [https://scicomp.jlab.org/docs/write-through-cache Write-Through Cache]
 +
* [https://scicomp.jlab.org/docs/swif SWIF]
 +
* [https://scicomp.jlab.org/docs/swif-cli SWIF Command Line]
  
A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
+
=== Job Tracking ===
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
+
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
run the previous Friday. The procedure for this is shown below.
+
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
 +
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
 +
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
 +
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
  
=== Setting up the environment ===
+
== Procedures: Overview ==
The file
+
/home/gxproj1/setup_jlab.csh
+
is sourced through .tcshrc.
+
This file is the same as what is linked to by
+
/home/gluex/setup_jlab_commissioning.csh,
+
except HALLD_HOME, HDDS_HOME, and JANA_CALIB_URL are set separately so that this
+
user can have a separate build.
+
  
To obtain the builds from the previous Friday's runs,
+
=== Online Monitoring: During Experimental Running ===
execute
+
/home/gxproj1/halld/monitoring/newruns/setup.sh [year] [month] [day]
+
The build revisions from the previous Friday are archived in files
+
/work/halld/data_monitoring/run_conditions/soft_comm_[year]_[month]_[day].xml
+
and the script will build libraries based on those stored revision numbers.
+
  
=== Running the cron job ===
+
After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.
  
To run the cron job go to
+
This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.
/u/home/gxproj1/halld/monitoring/newruns
+
and do
+
crontab cron_plugins
+
To check whether the cron job is running, do
+
crontab -l
+
To remove the cron job do
+
crontab -r
+
  
The cron job will run the script scan_for_jobs.sh,
+
For more details on the online monitoring system, see [https://halldweb.jlab.org/hdops/wiki/index.php/Online_Monitoring_Shift  this page].
which runs generatejobs_plugins_rawdata.sh for any
+
new runs that it had not seen before. All previous
+
runs are recorded in the file filelists/files_current.txt
+
so clear this to run over runs, or set the parameters
+
MINRUN and MAXRUN which will set the range of runs submitted.
+
  
==Running Over Archived Data==
+
=== Offline Monitoring and Reconstruction: During Experimental Running ===
  
Once the files are written to take we can run the online plugins on these files to confirm what we were seeing in the online monitoring.
+
During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
Manual scripts, and cron jobs are set up to look for new run numbers and run the plugin over a sample of files.
+
  
=== Details of Offline Monitoring ===
+
# '''Incoming:''' Monitor the first <span style="color:red">5</span> files of each newly-recorded run as soon as it hits the tape.
 +
# '''Monitoring Launches:''' Every <span style="color:red">two</span> weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.
 +
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
  
Below are the procedures to
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with.  Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.
* run a single offline plugin job manually
+
* launch weekly runs
+
* run a cron job to automate the process for new files
+
  
In principle these scripts should work, but if there are changes in
+
=== Offline Monitoring and Reconstruction: After Experimental Running ===
the directory structure for the rawdata files, or if there is a significant
+
increase in the memory or disk space necessary for the jobs, these should
+
be modified.
+
  
==== Generating an offline plugin job ====
+
After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
  
The user gxproj1 should be used for official offline monitoring jobs.
+
# '''Monitoring Launches:''' Every two weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.  
Within /home/gxproj1/halld/monitoring/batch/ there will be scripts to run the online monitoring plugins over tape files.
+
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
The main script is generatejobs_plugins_rawdata.sh, which can be used as
+
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
generatejobs_plugins_rawdata.sh [minrun] [maxrun] (minfile) (maxfile)
+
# '''Further Reconstruction Launches:''' Every <span style="color:red">~3</span> months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data. 
where minrun, maxrun specify the range of the run #, and minfile, maxfile (optional) specify
+
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
the file range for that run.
+
  
This will generate a script run_rawdata_XXXXXX.sh, where the run # has now been formatted to be 6 digits.
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, since there will be a significant amount of data.
Executing this script will send the monitoring plugins job to the Auger batch system.
+
  
There is also a script clean.sh which can be used as
+
=== Saving to Tape (Write-through Cache): Monitoring Launches ===
clean.sh XXX
+
All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:
This will clean up all associated files created in association with run XXX.
+
* REST files: All files.  
 +
* ROOT files: One merged file per run.  
 +
** After merge, the individual files are deleted (so they won't be saved).
 +
* Job stdout/stderr: One tarball per run
 +
** After launch analysis, the log files are deleted (so they won't be saved).
 +
* Browser png's: One tarball per launch
  
Internally, the xml file used to submit the job will be created, and the job to run
+
=== Saving to Tape (Write-through Cache): Full Reconstruction Launches ===
will be given within script.sh. All run parameters should be specified in at the beginning
+
* REST files: All files.  
of generatejobs_plugins_rawdata.sh
+
* ROOT files: All files, <span style="color:blue">AND</span> one merged file per run.  
Since we are running on tape, the tape file will first be copied over to the cache disk, and the job will run
+
* Job stdout/stderr: One tarball per run
over this cached file.
+
** After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
 +
* Browser png's: One tarball per launch
  
==== Launch weekly runs ====
 
  
Every Friday, jobs will be started to run the newest software on all previous runs.
+
== Procedures: Details ==
This is done using the gxproj1 account. Log in as this user, and
+
build HDDS, sim-recon, online plugins, and sqlite file.
+
  
 +
* [[Offline_Monitoring_Incoming_Data | Offline Monitoring: Running Over Incoming Data]]
 +
* [[Offline_Monitoring_Archived_Data | Offline Monitoring: Running Over Archived Data]]
 +
* [[Offline_Monitoring_Post_Processing | Offline Monitoring: Post-Processing]]
 +
* [[DEPRECATED_Offline_Monitoring_Archived_Data | DEPRECATED (Except plots): Offline Monitoring: Running Over Archived Data]]
 +
* [[DSelector_SWIF_Jobs | DSelector SWIF Jobs]]
 +
* [[Merging_Analysis_Trees | Analysis Launch: Merging Trees]]
  
 +
=== On- and Offline Monitoring Data Validation===
 +
* [[Offline_Monitoring_Data_Validation | Offline Monitoring: Data Validation]]
 +
* [[Offline_Monitoring_Data_Validation_PrimEx | Offline Monitoring: Data Validation of PrimEx data]]
 +
* [[Online_Monitoring_Data_Validation | Online Monitoring: Data Validation]]
  
==== Using cron to run automatically ====
+
== Software Tests ==
Within /home/gluex/halld/monitoring/cron/ there is a file cron_plugins
+
* [[Software_Test_Data_Recon | Software Test: Experimental Data Reconstruction]]
that can be executed via
+
** [https://halldweb.jlab.org/recon_test/ Test Results]
crontab cron_plugins
+
This will set up a cron job to call the script scan_for_jobs.sh, which will
+
check in the rawdata directory and call generatejobs_plugins_rawdata.sh for
+
any run that is more than 5 min old. The cron job is set up to run every 10 min.
+
 
+
==== Magnetic Field Settings ====
+
For tracking, it is necessary to set the correct magnetic field settings.
+
The below table is now obsolete. Please refer to the [https://halldweb1.jlab.org/cgi-bin/data_monitoring/run_conditions.pl GlueX Run Conditions] webpage.
+
 
+
The actual field settings for each run have not been documented well.
+
Below are values based on going through entries in the [https://logbooks.jlab.org/book/halld halld log].
+
 
+
{| border="1" cellpadding="1" style="text-align: center;"
+
!width="100"| Run #s
+
!width="150"| Solenoid Current (A)
+
!width="300"| JANA option
+
!width="300"| notes
+
|-
+
| 940 - 996  || 1000 || -PBFIELD_MAP=Magnets/Solenoid/solenoid_1000A_poisson_20141104  || At run 997, the solenoid started ramping down
+
|-
+
| 998 - 1448 ||    0 || -PBFIELD_TYPE=NoField -PDEFTAG:DTrackCandidate=StraightLine || Solenoid started ramping up around run 1431. Runs 1432 - 1448 should not be used for tracking. See [https://logbooks.jlab.org/entry/3309039 this] and [https://logbooks.jlab.org/entry/3309092 this] entry.
+
|-
+
| 1449 - 1620 || 1200 || -PBFIELD_MAP=Magnets/Solenoid/solenoid_1200A_poisson_20140520  ||
+
|-
+
|}
+
 
+
'''Note:''' The following run ranges were taken with a 300A field, and could be analyzed with the magnetic field map Magnets/Solenoid/solenoid_300A_poisson_20140819 , more study is needed to see if the straight line track fitter is needed or not:  1036  - 1053, 1065 - 1121, 1212? - 1254, 1309 - 1318 [1307,1308,1319-1329 were taken while the magnet was ramping].
+
 
+
=== Procedures of Running Offline Monitoring ===
+
This section is mostly just for documentation and is intended
+
for the person who will run the jobs periodically (currently Kei).
+
Currently we are thinking of running the jobs every Friday at the end of the day.
+
This allows everybody to see improvements in each detector over the week.
+
 
+
# <s>Acquire build lock for user gluex. This means sending out an email to Paul, Sean, Mark I, Kei to not make change in svn directories for user gluex, or to try to build anything until the jobs are finished.</s>
+
# We will run the jobs from user gxproj1 as of December 12 2014 on an independent build. This is the account that will process runs as they come in. To stop this, first kill the cron job with cron -r.
+
# svn update & rebuild HDDS
+
# svn update & rebuild sim-recon
+
# svn update & rebuild monitoring hists
+
# Prepare the latest sqlite file & update the JANA_CALIB_URL environment variable with
+
cp /group/halld/www/halldweb1/html/dist/ccdb.sqlite /group/halld/www/halldweb1/html/dist/ccdb_2014-MM-DD.sqlite
+
# Submit jobs. Currently we need to set the magnetic field settings by hand within the script generatejobs_plugins_rawdata.sh.
+
# <s>When jobs finish, unlock. This means sending another email to Paul, Sean, Mark I, Kei.</s>
+
# Restart cron jobs for immediate processing of runs coming in. Make sure to update magnetic field settings here also.
+
# Contact Sean to notify that there are new runs available in /volatile for copying over to the /work disk.
+
 
+
==Extracting Summary Data==
+
 
+
For high-level monitoring, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page.  This section describes how to generate the monitoring images and database information.
+
 
+
The scripts used to generate this summary data are currently kept in /u/home/gluex/halld/monitoring/process
+
Note that these scripts currently have some parameters which must be periodically set by hand.
+
 
+
The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database.  To run these scripts, load the environment with the following command
+
<syntaxhighlight>
+
source /u/home/gluex/halld/monitoring/process/monitoring_env.sh
+
</syntaxhighlight>
+
 
+
===Online Monitoring===
+
 
+
There are two scripts for running over the monitoring data generated by the online system and offline reconstruction.  The online script is run with either of the following commands:
+
<syntaxhighlight>
+
./check_new_runs.py
+
 
+
OR
+
 
+
./check_new_runs.csh
+
</syntaxhighlight>
+
The shell script sets up the environment properly to run the python script.  To connect to the monitoring database on the JLab CUE, modules continued in the installation of python >= 2.7 are needed.  The shell script is appropriate to use in a cron job.
+
 
+
The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house.  This python script automatically checks for new ROOT files, which it will then automatically process.  It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...
+
 
+
===Offline Monitoring===
+
 
+
The processing of offline monitoring data should be run after a new reconstruction pass is done.  The data is processed using the following script:
+
 
+
<syntaxhighlight>
+
./process_new_offline_data.py <input directory> <output directory>
+
 
+
EXAMPLE:
+
 
+
./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
+
</syntaxhighlight>
+
 
+
Every time a new reconstruction pass is performed, a new version number must be generated.  To do this, prepare a version file as described below.  Then run the register_new_version.py script to store the information in the database.  The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure.  An example of how to generate a new version is:
+
<syntaxhighlight>
+
./register_new_version.py add /u/home/gluex/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
+
</syntaxhighlight>
+
 
+
==Data Versions==
+
 
+
To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information.  The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.
+
 
+
We store one record per pass through one run period, with the following structure:
+
 
+
{| class="wikitable"
+
! Field !! Description
+
|-
+
| data_type || The level of data we are processing.  For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
+
|-
+
| run_period || The run period of the data
+
|-
+
| revision || An integer specifying which pass through the run period this data corresponds to
+
|-
+
| software_version || The name of the XML file that specifies the different software versions used
+
|-
+
| jana_config  || The name of the text file that specifies which JANA options were passed to the reconstruction program
+
|-
+
| ccdb_context  || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
+
|-
+
| production_time  || The data at which monitoring/reconstruction began
+
|-
+
| dataVersionString  || A convenient string for identifying this version of the data
+
|}
+
 
+
 
+
An example file used as as input to ./register_new_version.py is:
+
<syntaxhighlight>
+
data_type          = recon
+
run_period          = RunPeriod-2014-10
+
revision            = 1
+
software_version    = soft_comm_2014_11_06.xml
+
jana_config        = jana_rawdata_comm_2014_11_06.conf
+
ccdb_context        = calibtime=2014-11-10
+
production_time    = 2014-11-10
+
dataVersionString  = recon_RunPeriod-2014-10_20141110_ver01
+
</syntaxhighlight>
+

Revision as of 19:27, 28 October 2021

Master List of File / Database / Webpage Locations

Run Conditions

  • Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
  • Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
  • Run Info vers. 1
  • Run Info vers. 2
  • RCDB

Monitoring Output Files

  • Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
  • Online monitoring histograms: /work/halld/online_monitoring/root/
  • Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
  • individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

Monitoring Database

  • Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

Monitoring Webpages

SciComp Job Links

Main

Documentation

Job Tracking

Procedures: Overview

Online Monitoring: During Experimental Running

After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.

This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.

For more details on the online monitoring system, see this page.

Offline Monitoring and Reconstruction: During Experimental Running

During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Incoming: Monitor the first 5 files of each newly-recorded run as soon as it hits the tape.
  2. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  3. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with. Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.

Offline Monitoring and Reconstruction: After Experimental Running

After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  2. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
  3. Further Reconstruction Launches: Every ~3 months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, since there will be a significant amount of data.

Saving to Tape (Write-through Cache): Monitoring Launches

All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:

  • REST files: All files.
  • ROOT files: One merged file per run.
    • After merge, the individual files are deleted (so they won't be saved).
  • Job stdout/stderr: One tarball per run
    • After launch analysis, the log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch

Saving to Tape (Write-through Cache): Full Reconstruction Launches

  • REST files: All files.
  • ROOT files: All files, AND one merged file per run.
  • Job stdout/stderr: One tarball per run
    • After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch


Procedures: Details

On- and Offline Monitoring Data Validation

Software Tests