Difference between revisions of "Data Monitoring Procedures"

From GlueXWiki
Jump to: navigation, search
(Offline Monitoring: Running Over Archived Data)
(On- and Offline Monitoring Data Validation)
(148 intermediate revisions by 7 users not shown)
Line 7: Line 7:
 
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
 
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]
 +
*[https://halldweb.jlab.org/rcdb RCDB]
  
 
=== Monitoring Output Files ===
 
=== Monitoring Output Files ===
 
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
 
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
 
* Online monitoring histograms: /work/halld/online_monitoring/root/
 
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
+
* Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
+
* individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/
+
  
 
=== Monitoring Database ===
 
=== Monitoring Database ===
Line 19: Line 19:
  
 
=== Monitoring Webpages ===
 
=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
+
*[https://halldweb.jlab.org/wiki/index.php/Monitoring_webpage_help Help]
 +
*[https://halldweb.jlab.org/data_monitoring/Plot_Browser.html Plot Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/versionBrowser.py Version Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/recontestBrowser.py Recon Tests]
  
== Job Monitoring Links ==
+
== SciComp Job Links ==
 +
=== Main ===
 +
* [https://scicomp.jlab.org/scicomp/ Scientific Computing Home Page]
 +
* [https://scicomp.jlab.org/scicomp/#/auger/jobs Auger Job Status Page]
 +
* [https://scicomp.jlab.org/scicomp/#/jasmine/jobs JasMine Tape Job Status Page]
 +
 
 +
=== Documentation ===
 +
* [https://scicomp.jlab.org/docs/batch Batch System]
 +
* [https://scicomp.jlab.org/docs/storage Mass Storage System]
 +
* [https://scicomp.jlab.org/docs/write-through-cache Write-Through Cache]
 +
* [https://scicomp.jlab.org/docs/swif SWIF]
 +
* [https://scicomp.jlab.org/docs/swif-cli SWIF Command Line]
 +
 
 +
=== Job Tracking ===
 
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
 
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
 
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
 
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
Line 31: Line 47:
 
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
 
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
  
== Saving Online Monitoring Data ==
+
== Procedures: Overview ==
 
+
The procedure for writing the data out is given in, e.g.,
+
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].
+
 
+
Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
+
and within ~20 min., we will have access to the file on tape at
+
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.
+
 
+
All online monitoring plugins will be run as data is taken.
+
They will be accessible within the counting house via RootSpy, and
+
for each run and file, a ROOT file containing the histograms will be saved
+
within a subdirectory for each run.
+
 
+
For immediate access to these files, the raid disk files may be accessed directly
+
from the counting house, or the tape files will be available within ~20 min. of the
+
file being written out.
+
 
+
== Offline Monitoring: Running Over Archived Data ==
+
 
+
Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
+
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.
+
 
+
Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
+
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.
+
 
+
Below the procedures are described for
+
# Preparing the software for the launch
+
# Starting the launch (using hdswif)
+
# Post-analysis of statistics of the launch
+
 
+
Processing the results and making them available to the collaboration
+
is handled in the section [[#Post-Processing_Procedures | Post-Processing_Procedures]] below.
+
 
+
<!--
+
==== Generating an offline plugin job ====
+
 
+
The user gxproj1 should be used for official offline monitoring jobs.
+
Within /home/gxproj1/halld/monitoring/batch/ there will be scripts to run the online monitoring plugins over tape files.
+
The main script is generatejobs_plugins_rawdata.sh, which can be used as
+
generatejobs_plugins_rawdata.sh [minrun] [maxrun] (minfile) (maxfile)
+
where minrun, maxrun specify the range of the run #, and minfile, maxfile (optional) specify
+
the file range for that run.
+
 
+
This will generate a script run_rawdata_XXXXXX.sh, where the run # has now been formatted to be 6 digits.
+
Executing this script will send the monitoring plugins job to the Auger batch system.
+
 
+
There is also a script clean.sh which can be used as
+
clean.sh XXX
+
This will clean up all associated files created in association with run XXX.
+
 
+
Internally, the xml file used to submit the job will be created, and the job to run
+
will be given within script.sh. All run parameters should be specified in at the beginning
+
of generatejobs_plugins_rawdata.sh
+
Since we are running on tape, the tape file will first be copied over to the cache disk, and the job will run
+
over this cached file.
+
 
+
==== Using cron to run automatically ====
+
Within /home/gluex/halld/monitoring/cron/ there is a file cron_plugins
+
that can be executed via
+
crontab cron_plugins
+
This will set up a cron job to call the script scan_for_jobs.sh, which will
+
check in the rawdata directory and call generatejobs_plugins_rawdata.sh for
+
any run that is more than 5 min old. The cron job is set up to run every 10 min.
+
 
+
==== Magnetic Field Settings ====
+
For tracking, it is necessary to set the correct magnetic field settings.
+
The below table is now obsolete. Please refer to the [https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl GlueX Run Conditions] webpage.
+
 
+
The actual field settings for each run have not been documented well.
+
Below are values based on going through entries in the [https://logbooks.jlab.org/book/halld halld log].
+
 
+
{| border="1" cellpadding="1" style="text-align: center;"
+
!width="100"| Run #s
+
!width="150"| Solenoid Current (A)
+
!width="300"| JANA option
+
!width="300"| notes
+
|-
+
| 940 - 996  || 1000 || -PBFIELD_MAP=Magnets/Solenoid/solenoid_1000A_poisson_20141104  || At run 997, the solenoid started ramping down
+
|-
+
| 998 - 1448 ||    0 || -PBFIELD_TYPE=NoField -PDEFTAG:DTrackCandidate=StraightLine || Solenoid started ramping up around run 1431. Runs 1432 - 1448 should not be used for tracking. See [https://logbooks.jlab.org/entry/3309039 this] and [https://logbooks.jlab.org/entry/3309092 this] entry.
+
|-
+
| 1449 - 1620 || 1200 || -PBFIELD_MAP=Magnets/Solenoid/solenoid_1200A_poisson_20140520  ||
+
|-
+
|}
+
 
+
'''Note:''' The following run ranges were taken with a 300A field, and could be analyzed with the magnetic field map Magnets/Solenoid/solenoid_300A_poisson_20140819 , more study is needed to see if the straight line track fitter is needed or not:  1036  - 1053, 1065 - 1121, 1212? - 1254, 1309 - 1318 [1307,1308,1319-1329 were taken while the magnet was ramping].
+
-->
+
 
+
=== Procedures ===
+
This section explains how the offline monitoring should be run.
+
Since we may want to simultaneously run offline monitoring for different run periods that require
+
different environment variables, the scripts are set up so that a generic user can download the
+
scripts and run them from anywhere.
+
 
+
To do this, you will check out a directory from svn that contains all the necessary scripts,
+
and running a generation script will generate all the necessary files for the present launch.
+
The main engine behind keeping track of all the files submitted and their status is the jproj
+
system created by Mark Ito.
+
 
+
Below is the process of creating and running a launch.
+
 
+
 
+
# To stop the incoming-data cron job, first kill the cron job with cron -r. Also, delete all jobs that have not started yet. If these jobs are still alive, they will cause confusion over which software version they ran. Do <pre>crontab -l</pre> to list the current cron jobs, and run <pre>crontab -r</pre> to kill them.
+
# svn update & rebuild HDDS with <pre>cd $HDDS_HOME</pre><pre> svn up</pre><pre> scons install</pre>
+
# svn update & rebuild sim-recon with <pre>cd $HALLD_HOME</pre> <pre> svn up</pre> <pre>cd src</pre> <pre>scons -c install</pre> <pre>scons install</pre> The "scons -c install" is necessary to clean out any old header files. NOTE THAT IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT AND CHECK OUT A NEW VERSION SO THAT STALE HEADER FILES ARE NOT PRESENT.
+
# svn update & rebuild monitoring plugins with <pre>cd /home/gxproj1/builds/online/packages/SBMS</pre> <pre> svn up</pre> <pre>cd /home/gxproj1/builds/online/packages/monitoring/src/plugins</pre> <pre> svn up</pre> <pre>scons -u install</pre>
+
# Prepare the latest sqlite file with: <pre>cd $HOME/builds/</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre>
+
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the svn revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.
+
# Create the appropriate project(s) and submit the jobs using the Hall D Job Management System, as detailed in the section below.
+
# Restart cron jobs for immediate processing of runs coming in.
+
 
+
== Hall D Job Management System ==
+
 
+
This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito.  These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well. 
+
 
+
=== Database Table Overview ===
+
 
+
* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.
+
 
+
* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.
+
 
+
* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.
+
 
+
=== Initialize Project Management ===
+
 
+
* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
+
<pre>
+
ssh gxproj1@ifarm -Y
+
</pre>
+
 
+
* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/
+
 
+
* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>
+
 
+
* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
+
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
+
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
+
The name has been chosen to be as consistent as possible with other directory structures.
+
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
+
given as 2YYY_MM instead of 2YYY-MM.
+
 
+
* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.
+
 
+
* For each project, cd into the new directory
+
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
+
<pre>./clear.sh</pre>
+
* To use the jproj.pl script that was checked in, add the directory to your path with
+
<pre>source ../../scripts/setup.csh</pre>
+
or always specify the full path
+
<pre>../../scripts/jproj.pl</pre>
+
* Now update the table of runs with
+
<pre>jproj.pl <project name> update</pre>
+
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
+
* Once you have registered all of the files you would like to run over, do
+
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
+
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.
+
 
+
At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
+
at first to check that all scripts are working and that the plugins do not crash. Once you are
+
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
+
which will (among other things) put the results on the online webpage for the collaboration to view, and
+
the analysis of the launch.
+
 
+
=== Project File Overview ===
+
 
+
An overview of each project file:
+
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
+
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
+
* '''<project_name>.jsub:''' The xml job submission script.  The run number and file number variables are set during job submission for each input file.
+
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
+
<pre>
+
mkdir -p -m 775 my_directory
+
</pre>
+
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
+
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.
+
 
+
=== Project Management ===
+
 
+
* Delete (if any) and create the database table(s) for the current set of job submissions:
+
<pre>
+
./clear.sh
+
</pre>
+
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
+
<pre>
+
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
+
</pre>
+
 
+
* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>).  You can test by adding an optional argument at the end, which only selects files with a specific file number:
+
<pre>
+
jproj.pl <project_name> update <optional_file_number>
+
</pre>
+
 
+
* Confirm that the job management database is accurate by printing it's contents to screen:
+
<pre>
+
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
+
</pre>
+
 
+
* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
+
<pre>
+
./clear.sh
+
</pre>
+
 
+
* To look at the status of the submitted jobs, first query auger and update the job status database:
+
<pre>
+
fill_in_job_details.pl <project_name>
+
</pre>
+
 
+
* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
+
<pre>
+
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
+
</pre>
+
 
+
* These last two commands can instead be executed simultaneously by running:
+
<pre>
+
./status.sh
+
</pre>
+
 
+
=== Handy mysql Instructions ===
+
 
+
* Handy mysql instructions:
+
<pre>
+
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
+
quit; # Exit mysql
+
show tables; # Show a list of the tables in the current database
+
show columns from <project_name>; # show all of the columns for the given table
+
select * from <project_name>; # show the contents of all rows from the given table
+
</pre>
+
 
+
=== Backing Up Offline Monitoring Tables ===
+
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
+
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects
+
 
+
The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
+
Since executing this output file will drop the table when it exists, caution is advised.
+
Example usage to backup all three tables created for run period 2014_10 ver 17:
+
<pre>backup_tables.sh 2014_10 17</pre>
+
 
+
== Running Over Data As It Comes In ==
+
 
+
A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
+
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
+
run the previous Friday. The procedure for this is shown below.
+
 
+
<!--
+
=== Setting up the environment ===
+
The file
+
/home/gxproj1/setup_jlab.csh
+
is sourced through .tcshrc.
+
This file is the same as what is linked to by
+
/home/gluex/setup_jlab_commissioning.csh,
+
except HALLD_HOME, HDDS_HOME, and JANA_CALIB_URL are set separately so that this
+
user can have a separate build.
+
 
+
To obtain the builds from the previous Friday's runs,
+
execute
+
/home/gxproj1/halld/monitoring/newruns/setup_previous.sh [year] [month] [day]
+
The build revisions from the previous Friday are archived in files
+
/work/halld/data_monitoring/run_conditions/soft_comm_[year]_[month]_[day].xml
+
and the script will build libraries based on those stored revision numbers.
+
-->
+
 
+
=== Running the cron job ===
+
 
+
'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.
+
 
+
* Go to the cron job directory:
+
<pre>
+
cd /u/home/gxproj1/halld/monitoring/newruns
+
</pre>
+
 
+
* The cron_plugins file is the cronjob that will be executed.  During execution, it runs the exec.sh command in the same folder.  This command takes two arguments: the project name, and the maximum file number for each run.  These fields should be updated in the cron_plugins file before running.
+
 
+
* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number.  It then submits jobs for these files. 
+
 
+
* To start the cron job, run:
+
<pre>
+
crontab cron_plugins
+
</pre>
+
 
+
* To check whether the cron job is running, do
+
<pre>
+
crontab -l
+
</pre>
+
 
+
* To remove the cron job do
+
<pre>
+
crontab -r
+
</pre>
+
 
+
<!--
+
The cron job will run the script scan_for_jobs.sh,
+
which runs generatejobs_plugins_rawdata.sh for any
+
new runs that it had not seen before. All previous
+
runs are recorded in the file filelists/files_current.txt
+
so clear this to run over runs, or set the parameters
+
MINRUN and MAXRUN which will set the range of runs submitted.
+
-->
+
 
+
==Post-Processing Procedures==
+
 
+
To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page.  This section describes how to generate the monitoring images and database information.
+
 
+
The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process .  If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
+
<syntaxhighlight>
+
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
+
</syntaxhighlight>
+
Note that these scripts currently have some parameters which must be periodically set by hand.
+
 
+
The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database.  To run these scripts, load the environment with the following command
+
<syntaxhighlight>
+
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
+
</syntaxhighlight>
+
 
+
===Online Monitoring===
+
 
+
There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
+
<syntaxhighlight>
+
/home/gluex/halld/monitoring/process/check_new_runs.py
+
 
+
OR
+
 
+
/home/gluex/halld/monitoring/process/check_new_runs.csh
+
</syntaxhighlight>
+
The shell script sets up the environment properly to run the python script.  To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed.  The shell script is appropriate to use in a cron job.  The cronjob is currently run under the "gluex" account.
+
 
+
The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house.  This python script automatically checks for new ROOT files, which it will then automatically process.  It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...
+
  
===Offline Monitoring===
+
=== Online Monitoring: During Experimental Running ===
  
After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages.  Currently, this processing is controlled by a cronjob that runs the following script:
+
After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.
<syntaxhighlight>
+
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh 
+
</syntaxhighlight>
+
This script checks for new ROOT files, and only runs over those it hasn't processed yet.  Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.
+
  
Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros.  If you want to change the list of plots made, you must modify one of the following files:
+
This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
+
* macros_to_monitor - specify the full path to the RootSpy macro .C file
+
  
When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
+
For more details on the online monitoring system, see [https://halldweb.jlab.org/hdops/wiki/index.php/Online_Monitoring_Shift this page].
# Add a new data version, as described below:
+
# Change the following parameters in check_monitoring_data.csh:
+
## JOBDATE should correspond to the ouptut date used by the job submission script
+
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
+
## Once you create a new data version as defined below, you should pass the needed information as a command line optionCurrently this is done by the ARGS variable.  For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.
+
  
<syntaxhighlight>
+
=== Offline Monitoring and Reconstruction: During Experimental Running ===
Example configuration parameters:
+
set JOBDATE=2015-01-09
+
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
+
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
+
set ARGS=" -v RunPeriod-2014-10,8 "
+
</syntaxhighlight>
+
If you want to process the results manually, the data is processed using the following script:
+
<syntaxhighlight>
+
./process_new_offline_data.py <input directory> <output directory>
+
  
EXAMPLE:
+
During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:  
  
./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
+
# '''Incoming:''' Monitor the first <span style="color:red">5</span> files of each newly-recorded run as soon as it hits the tape.
</syntaxhighlight>
+
# '''Monitoring Launches:''' Every <span style="color:red">two</span> weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.  
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.
+
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
  
Every time a new reconstruction pass is performed, a new version number must be generatedTo do this, prepare a version file as described below.  Then run the register_new_version.py script to store the information in the database.  The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up withAlso, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.
<syntaxhighlight>
+
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
+
</syntaxhighlight>
+
  
<b>If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
+
=== Offline Monitoring and Reconstruction: After Experimental Running ===
of the svn repository</b>, and created a project with
+
create_project.sh [project name] hd_rawdata
+
Then go to the directory [project name]/processing/
+
and execute
+
./run_processing.sh
+
which will run register_new_version.py  as well as check_monitoring_data.csh for that project.
+
  
===Step-by-Step Instructions For Processing a New Monitoring Run===
+
After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
  
The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.
+
# '''Monitoring Launches:''' Every two weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.
 +
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
 +
# '''Further Reconstruction Launches:''' Every <span style="color:red">~3</span> months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data. 
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
  
# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, since there will be a significant amount of data.
# Run "svn update" to bring any changes in.  Be sure that the list of histograms and macros to plot are current.
+
# Edit check_monitoring_data.csh to point to the current revisions/directories
+
#* VERSION
+
#* ARGS
+
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
+
# Update files in the web directory, so that the results are displayed on the web pages:  /group/halld/www/halldweb/html/data_monitoring/textdata
+
  
Check log files in  $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.
+
=== Saving to Tape (Write-through Cache): Monitoring Launches ===
 +
All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:
 +
* REST files: All files.
 +
* ROOT files: One merged file per run.  
 +
** After merge, the individual files are deleted (so they won't be saved).  
 +
* Job stdout/stderr: One tarball per run
 +
** After launch analysis, the log files are deleted (so they won't be saved).  
 +
* Browser png's: One tarball per launch
  
==Data Versions==
+
=== Saving to Tape (Write-through Cache): Full Reconstruction Launches ===
 +
* REST files: All files.
 +
* ROOT files: All files, <span style="color:blue">AND</span> one merged file per run.
 +
* Job stdout/stderr: One tarball per run
 +
** After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
 +
* Browser png's: One tarball per launch
  
To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information.  The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.
 
  
We store one record per pass through one run period, with the following structure:
+
== Procedures: Details ==
  
{| class="wikitable"
+
* [[Offline_Monitoring_Incoming_Data | Offline Monitoring: Running Over Incoming Data]]
! Field !! Description
+
* [[Offline_Monitoring_Archived_Data | Offline Monitoring: Running Over Archived Data]]
|-
+
* [[Offline_Monitoring_Post_Processing | Offline Monitoring: Post-Processing]]
| data_type || The level of data we are processing.  For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
+
* [[DEPRECATED_Offline_Monitoring_Archived_Data | DEPRECATED (Except plots): Offline Monitoring: Running Over Archived Data]]
|-  
+
* [[DSelector_SWIF_Jobs | DSelector SWIF Jobs]]
| run_period || The run period of the data
+
* [[Merging_Analysis_Trees | Analysis Launch: Merging Trees]]
|-
+
| revision || An integer specifying which pass through the run period this data corresponds to
+
|-
+
| software_version || The name of the XML file that specifies the different software versions used
+
|-
+
| jana_config  || The name of the text file that specifies which JANA options were passed to the reconstruction program
+
|-
+
| ccdb_context  || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
+
|-
+
| production_time  || The data at which monitoring/reconstruction began
+
|-
+
| dataVersionString  || A convenient string for identifying this version of the data
+
|}
+
  
 +
=== On- and Offline Monitoring Data Validation===
 +
* [[Offline_Monitoring_Data_Validation | Offline Monitoring: Data Validation]]
 +
* [[Offline_Monitoring_Data_Validation_PrimEx | Offline Monitoring: Data Validation of PrimEx data]]
 +
* [[Online_Monitoring_Data_Validation | Online Monitoring: Data Validation]]
  
An example file used as as input to ./register_new_version.py is:
+
== Software Tests ==
<syntaxhighlight>
+
* [[Software_Test_Data_Recon | Software Test: Experimental Data Reconstruction]]
data_type          = recon
+
** [https://halldweb.jlab.org/recon_test/ Test Results]
run_period          = RunPeriod-2014-10
+
revision            = 1
+
software_version    = soft_comm_2014_11_06.xml
+
jana_config        = jana_rawdata_comm_2014_11_06.conf
+
ccdb_context        = calibtime=2014-11-10
+
production_time    = 2014-11-10
+
dataVersionString  = recon_RunPeriod-2014-10_20141110_ver01
+
</syntaxhighlight>
+

Revision as of 19:27, 28 October 2021

Master List of File / Database / Webpage Locations

Run Conditions

  • Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
  • Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
  • Run Info vers. 1
  • Run Info vers. 2
  • RCDB

Monitoring Output Files

  • Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
  • Online monitoring histograms: /work/halld/online_monitoring/root/
  • Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
  • individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

Monitoring Database

  • Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

Monitoring Webpages

SciComp Job Links

Main

Documentation

Job Tracking

Procedures: Overview

Online Monitoring: During Experimental Running

After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.

This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.

For more details on the online monitoring system, see this page.

Offline Monitoring and Reconstruction: During Experimental Running

During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Incoming: Monitor the first 5 files of each newly-recorded run as soon as it hits the tape.
  2. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  3. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with. Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.

Offline Monitoring and Reconstruction: After Experimental Running

After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  2. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
  3. Further Reconstruction Launches: Every ~3 months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, since there will be a significant amount of data.

Saving to Tape (Write-through Cache): Monitoring Launches

All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:

  • REST files: All files.
  • ROOT files: One merged file per run.
    • After merge, the individual files are deleted (so they won't be saved).
  • Job stdout/stderr: One tarball per run
    • After launch analysis, the log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch

Saving to Tape (Write-through Cache): Full Reconstruction Launches

  • REST files: All files.
  • ROOT files: All files, AND one merged file per run.
  • Job stdout/stderr: One tarball per run
    • After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch


Procedures: Details

On- and Offline Monitoring Data Validation

Software Tests