Difference between revisions of "Data Monitoring Procedures"

From GlueXWiki
Jump to: navigation, search
(Offline Monitoring: Running Over Archived Data)
(On- and Offline Monitoring Data Validation)
(51 intermediate revisions by 6 users not shown)
Line 19: Line 19:
  
 
=== Monitoring Webpages ===
 
=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
+
*[https://halldweb.jlab.org/wiki/index.php/Monitoring_webpage_help Help]
 +
*[https://halldweb.jlab.org/data_monitoring/Plot_Browser.html Plot Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/versionBrowser.py Version Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/versionBrowser.py Version Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/recontestBrowser.py Recon Tests]
  
 
== SciComp Job Links ==
 
== SciComp Job Links ==
Line 29: Line 31:
 
* [https://scicomp.jlab.org/scicomp/ Scientific Computing Home Page]
 
* [https://scicomp.jlab.org/scicomp/ Scientific Computing Home Page]
 
* [https://scicomp.jlab.org/scicomp/#/auger/jobs Auger Job Status Page]
 
* [https://scicomp.jlab.org/scicomp/#/auger/jobs Auger Job Status Page]
* [https://scicomp.jlab.org/scicomp/#/jasmine/jobs Jasmine Tape Job Status Page]
+
* [https://scicomp.jlab.org/scicomp/#/jasmine/jobs JasMine Tape Job Status Page]
  
 
=== Documentation ===
 
=== Documentation ===
 
* [https://scicomp.jlab.org/docs/batch Batch System]
 
* [https://scicomp.jlab.org/docs/batch Batch System]
 
* [https://scicomp.jlab.org/docs/storage Mass Storage System]
 
* [https://scicomp.jlab.org/docs/storage Mass Storage System]
 +
* [https://scicomp.jlab.org/docs/write-through-cache Write-Through Cache]
 
* [https://scicomp.jlab.org/docs/swif SWIF]
 
* [https://scicomp.jlab.org/docs/swif SWIF]
 
* [https://scicomp.jlab.org/docs/swif-cli SWIF Command Line]
 
* [https://scicomp.jlab.org/docs/swif-cli SWIF Command Line]
Line 44: Line 47:
 
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
 
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
  
== Saving Online Monitoring Data ==
+
== Procedures: Overview ==
  
The procedure for writing the data out is given in, e.g.,
+
=== Online Monitoring: During Experimental Running ===
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].
+
  
Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
+
After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.
and within ~20 min., we will have access to the file on tape at
+
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.
+
  
All online monitoring plugins will be run as data is taken.
+
This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.
They will be accessible within the counting house via RootSpy, and
+
for each run and file, a ROOT file containing the histograms will be saved
+
within a subdirectory for each run.
+
  
For immediate access to these files, the raid disk files may be accessed directly
+
For more details on the online monitoring system, see [https://halldweb.jlab.org/hdops/wiki/index.php/Online_Monitoring_Shift  this page].
from the counting house, or the tape files will be available within ~20 min. of the
+
file being written out.
+
  
== Procedures ==
+
=== Offline Monitoring and Reconstruction: During Experimental Running ===
[[Offline_Monitoring_Archived_Data | Offline Monitoring: Running Over Archived Data]]
+
== Explanation of Scripts ==
+
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.
+
  
=== hdswif scripts ===
+
During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
<b>Summary:</b> hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.
+
  
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
+
# '''Incoming:''' Monitor the first <span style="color:red">5</span> files of each newly-recorded run as soon as it hits the tape.  
!width="150"| file name
+
# '''Monitoring Launches:''' Every <span style="color:red">two</span> weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.  
!width="800"| Description
+
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.  
|-
+
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
| hdswif.py ||
+
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
+
|-
+
| parse_swif.py ||
+
Called within hdswif.py to create html output from SWIF results.
+
|-
+
| createXMLfiles.py ||
+
* Called by the "create" option of hdswif.py
+
* Creates XML files for logging information about launch. Must specify config file with option -c.
+
* Also adds tags to git repositories of sim-recon and hdds.
+
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
+
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
+
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
+
* For each launch, <b>the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.</b>
+
|-
+
| read_config.py ||
+
* Called by the "add" option of hdswif.py
+
* Takes in config file name, optionally set verbose
+
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
+
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
+
|-
+
| output_job_details.py ||
+
* Called by the "details" option of hdswif.py
+
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
+
* Finds ids for jobs specified by the run and file number and returns info on each one
+
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
+
|-
+
| results_by_resources.py ||
+
* Called within parse_swif.py
+
* Takes in XML output from SWIF and creates 2 plots
+
* Creates html table showing results of jobs by resources requested for that job
+
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
+
|-
+
| create_ordered_hists.py ||
+
* Called within parse_swif.py
+
* Takes in XML output from SWIF and creates 2 plots
+
* Plots are dependency time and pending time of each job, ordered by submission.
+
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
+
|-
+
| create_stacked_times.py ||
+
* Called within parse_swif.py
+
* Takes in XML output from SWIF and creates 2 plots
+
* Plots show total job time divided into colors for different stages
+
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
+
* The jobs show at a glance which stage contributes how much of the job's time
+
|}
+
  
=== Utility scripts ===
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with. Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
+
!width="150"| file name
+
!width="800"| Description
+
|-
+
| stderr_by_size.py ||
+
* Independent of all other scripts in hdswif directory
+
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
+
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, <b>in separate directories given by stderr file size</b>.
+
|}
+
  
=== cross_analysis scripts ===
+
=== Offline Monitoring and Reconstruction: After Experimental Running ===
<b>Summary:</b> The script run_cross_analysis.sh is the main script. In principle, running this with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre> should work, but it is recommended that the commands within the script
+
be run by hand to catch any errors.
+
  
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
+
After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
!width="150"| file name
+
!width="800"| Description
+
|-
+
| run_cross_analysis.sh ||
+
* Main script to call other Python scripts.
+
* Takes in run period and version as arguments and runs cross analysis
+
|-
+
| create_cross_analysis_table.sh ||
+
* Creates MySQL table for current launch that is used by other scripts
+
* Takes run period and version as input, and creates table cross_analysis_table_[RUNPERIOD]_ver[VERSION]
+
|-
+
| create_stats_table_row.py ||
+
* Create a row in the html table showing the overall statistics of final states and problems for a launch
+
* The final states are "Success", and "Segfault". "Success" includes all jobs that had problems but still finished with Success.
+
* The problems are "Over Limit", "Timeout", and "System". If any of these occurred for any attempt of the job they are counted.
+
* The script creates a new html table row for the current launch. This html snippet is inserted into the web-accessible file showing the results
+
|-
+
| create_stats_for_each_file.py ||
+
* Create html tables showing the status of each file against different launch versions.
+
* Takes in run period, min version, max version and shows final result and problems for all versions in between
+
* Different final results and problems are shown by combinations of ext content, olor coding of text, background color
+
|-
+
|-
+
| create_resource_correlation_plots.py ||
+
* Create plots showing correlation of resources between different launches for each file.
+
* Takes in run period, min version, version of interest. Points are shown only for files included in version of interest
+
* Creates plots for CPU time, Wall time, memory, virtual memory, #events, difference in #events, time to copy input evio file, time to run plugin
+
|}
+
  
== Running Over Data As It Comes In ==
+
# '''Monitoring Launches:''' Every two weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.
 +
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
 +
# '''Further Reconstruction Launches:''' Every <span style="color:red">~3</span> months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data. 
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
  
A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, since there will be a significant amount of data.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
+
run the previous Friday. The procedure for this is shown below.
+
  
<!--
+
=== Saving to Tape (Write-through Cache): Monitoring Launches ===
=== Setting up the environment ===
+
All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:
The file
+
* REST files: All files.  
/home/gxproj1/setup_jlab.csh
+
* ROOT files: One merged file per run.  
is sourced through .tcshrc.
+
** After merge, the individual files are deleted (so they won't be saved).
This file is the same as what is linked to by
+
* Job stdout/stderr: One tarball per run
/home/gluex/setup_jlab_commissioning.csh,
+
** After launch analysis, the log files are deleted (so they won't be saved).
except HALLD_HOME, HDDS_HOME, and JANA_CALIB_URL are set separately so that this
+
* Browser png's: One tarball per launch
user can have a separate build.
+
  
To obtain the builds from the previous Friday's runs,
+
=== Saving to Tape (Write-through Cache): Full Reconstruction Launches ===
execute
+
* REST files: All files.  
/home/gxproj1/halld/monitoring/newruns/setup_previous.sh [year] [month] [day]
+
* ROOT files: All files, <span style="color:blue">AND</span> one merged file per run.  
The build revisions from the previous Friday are archived in files
+
* Job stdout/stderr: One tarball per run
/work/halld/data_monitoring/run_conditions/soft_comm_[year]_[month]_[day].xml
+
** After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).  
and the script will build libraries based on those stored revision numbers.
+
* Browser png's: One tarball per launch
-->
+
  
=== Running the cron job ===
 
  
'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.
+
== Procedures: Details ==
  
* Go to the cron job directory:  
+
* [[Offline_Monitoring_Incoming_Data | Offline Monitoring: Running Over Incoming Data]]
<pre>
+
* [[Offline_Monitoring_Archived_Data | Offline Monitoring: Running Over Archived Data]]
cd /u/home/gxproj1/halld/monitoring/newruns
+
* [[Offline_Monitoring_Post_Processing | Offline Monitoring: Post-Processing]]
</pre>
+
* [[DEPRECATED_Offline_Monitoring_Archived_Data | DEPRECATED (Except plots): Offline Monitoring: Running Over Archived Data]]
 +
* [[DSelector_SWIF_Jobs | DSelector SWIF Jobs]]
 +
* [[Merging_Analysis_Trees | Analysis Launch: Merging Trees]]
  
* The cron_plugins file is the cronjob that will be executed.  During execution, it runs the exec.sh command in the same folder.  This command takes two arguments: the project name, and the maximum file number for each run.  These fields should be updated in the cron_plugins file before running.
+
=== On- and Offline Monitoring Data Validation===
 +
* [[Offline_Monitoring_Data_Validation | Offline Monitoring: Data Validation]]
 +
* [[Offline_Monitoring_Data_Validation_PrimEx | Offline Monitoring: Data Validation of PrimEx data]]
 +
* [[Online_Monitoring_Data_Validation | Online Monitoring: Data Validation]]
  
* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number.  It then submits jobs for these files. 
+
== Software Tests ==
 
+
* [[Software_Test_Data_Recon | Software Test: Experimental Data Reconstruction]]
* To start the cron job, run:
+
** [https://halldweb.jlab.org/recon_test/ Test Results]
<pre>
+
crontab cron_plugins
+
</pre>
+
 
+
* To check whether the cron job is running, do
+
<pre>
+
crontab -l
+
</pre>
+
 
+
* To remove the cron job do
+
<pre>
+
crontab -r
+
</pre>
+
 
+
<!--
+
The cron job will run the script scan_for_jobs.sh,
+
which runs generatejobs_plugins_rawdata.sh for any
+
new runs that it had not seen before. All previous
+
runs are recorded in the file filelists/files_current.txt
+
so clear this to run over runs, or set the parameters
+
MINRUN and MAXRUN which will set the range of runs submitted.
+
-->
+
 
+
==Post-Processing Procedures==
+
 
+
To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on the monitoring web pages. The results from different raw data files in a run are also combined in to single ROOT and REST files. This section describes how to generate the monitoring images and database information.
+
 
+
The post-processing scripts generally perform the following steps for each run:
+
 
+
# Summarize monitoring information from each EVIO file, store this information in a database
+
# Merge the monitoring ROOT files into a single file for the run
+
# Generate summary monitoring information for the run and store it in a database
+
# Generate summary monitoring plots and store these in a web-accessible location
+
# Merge the REST files generated by the monitoring jobs into a single file for each run
+
 
+
The scripts used to generate this summary data are primarily run from /home/gxprojN/monitoring/process i.e. the same account from which the monitoring launch was performed.  If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
+
<syntaxhighlight>
+
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
+
</syntaxhighlight>
+
 
+
Note that these scripts depend on standard GlueX environment definitions to load the python modules needed to access MySQL databases.
+
 
+
===Online Monitoring===
+
 
+
There are two primary scripts for running over the monitoring data generated by the online system. The online script can be run with either of the following commands:
+
<syntaxhighlight>
+
/home/gluex/halld/monitoring/process/check_new_runs.py
+
 
+
OR
+
 
+
/home/gluex/halld/monitoring/process/check_new_runs.csh
+
</syntaxhighlight>
+
The shell script is appropriate to use in a cron job.  The cronjob is currently run under the "gluex" account.
+
 
+
The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house.  This python script automatically checks for new ROOT files, which it will then automatically process.  It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...  Currently it will load run meta-info based on the run conditions text file which is also copied by the online system - this may change when the RCDB is fully online.
+
 
+
'''IMPORTANT''' - When a new run period is started, a new data version must be created, and the scripts updated to reflect the new run period.  You may want to update the run number range to scan as well.
+
 
+
===Offline Monitoring===
+
 
+
After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages.  Currently, this processing is controlled by a cronjob that runs the following script:
+
<syntaxhighlight>
+
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh 
+
</syntaxhighlight>
+
The default behavior of this script is as following:  This script checks for new ROOT files, and only runs over those it hasn't processed yet.  Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT and REST files for a run are combined into single files.  Information is stored in the database on a per-file basis and for the whole run.
+
 
+
This procedure has many options, and many of these steps can be toggled on and off.  Look at the output of "process_new_offline_data.py -h" for more information.
+
 
+
Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros.  If you want to change the list of plots made, you must modify one of the following files:
+
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
+
* macros_to_monitor - specify the full path to the RootSpy macro .C file
+
 
+
Note that the most time-consuming parts of this process are merging the ROOT and REST files.
+
 
+
===Step-by-Step Instructions For Processing a New Offline Monitoring Run===
+
 
+
The monitoring launches are currently run out of the gxproj1 and gxproj5 accounts.  After an offline monitoring launch has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.
+
 
+
# The post-processing scripts are stored in $HOME/monitoring/process and are automatically run by cron.
+
# Run "svn update" to bring any changes in.  Be sure that the list of histograms and macros to plot are current.
+
# Add a new data version [as described below]
+
# Edit check_monitoring_data.csh to point to the current revisions/directories
+
#* RUNPERIOD
+
#* VERSION
+
#* ARGS
+
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh or $HOME/env_monitoring_launch
+
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
+
# The current policy is to keep the REST files on the volatile disk and allow them to be deleted according to that disk's cleanup policy. The latest version of the files should always be available. Can also copy the REST files to more permanent locations:
+
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV  [under testing]
+
 
+
Check log files in  $HOME/monitoring/process/log for more information on how each run went.  If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.
+
 
+
==Data Versions==
+
 
+
To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information.  The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.
+
 
+
We store one record per pass through one run period, with the following structure:
+
 
+
{| class="wikitable"
+
! Field !! Description
+
|-
+
| data_type || The level of data we are processing.  For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
+
|-
+
| run_period || The run period of the data
+
|-
+
| revision || An integer specifying which pass through the run period this data corresponds to
+
|-
+
| software_version || The name of the XML file that specifies the different software versions used
+
|-
+
| jana_config  || The name of the text file that specifies which JANA options were passed to the reconstruction program
+
|-
+
| ccdb_context  || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
+
|-
+
| production_time  || The data at which monitoring/reconstruction began
+
|-
+
| dataVersionString  || A convenient string for identifying this version of the data
+
|}
+
 
+
 
+
An example file used as as input to ./register_new_version.py is:
+
<syntaxhighlight>
+
data_type          = recon
+
run_period          = RunPeriod-2014-10
+
revision            = 1
+
software_version    = soft_comm_2014_11_06.xml
+
jana_config        = jana_rawdata_comm_2014_11_06.conf
+
ccdb_context        = calibtime=2014-11-10
+
production_time    = 2014-11-10
+
dataVersionString  = recon_RunPeriod-2014-10_20141110_ver01
+
</syntaxhighlight>
+

Revision as of 19:27, 28 October 2021

Master List of File / Database / Webpage Locations

Run Conditions

  • Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
  • Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
  • Run Info vers. 1
  • Run Info vers. 2
  • RCDB

Monitoring Output Files

  • Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
  • Online monitoring histograms: /work/halld/online_monitoring/root/
  • Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
  • individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

Monitoring Database

  • Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

Monitoring Webpages

SciComp Job Links

Main

Documentation

Job Tracking

Procedures: Overview

Online Monitoring: During Experimental Running

After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.

This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.

For more details on the online monitoring system, see this page.

Offline Monitoring and Reconstruction: During Experimental Running

During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Incoming: Monitor the first 5 files of each newly-recorded run as soon as it hits the tape.
  2. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  3. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with. Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.

Offline Monitoring and Reconstruction: After Experimental Running

After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  2. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
  3. Further Reconstruction Launches: Every ~3 months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, since there will be a significant amount of data.

Saving to Tape (Write-through Cache): Monitoring Launches

All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:

  • REST files: All files.
  • ROOT files: One merged file per run.
    • After merge, the individual files are deleted (so they won't be saved).
  • Job stdout/stderr: One tarball per run
    • After launch analysis, the log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch

Saving to Tape (Write-through Cache): Full Reconstruction Launches

  • REST files: All files.
  • ROOT files: All files, AND one merged file per run.
  • Job stdout/stderr: One tarball per run
    • After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch


Procedures: Details

On- and Offline Monitoring Data Validation

Software Tests