Difference between revisions of "Data Monitoring Procedures"

From GlueXWiki
Jump to: navigation, search
(Starting the Launch and Submitting Jobs)
(On- and Offline Monitoring Data Validation)
 
(76 intermediate revisions by 8 users not shown)
Line 7: Line 7:
 
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
 
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]
 +
*[https://halldweb.jlab.org/rcdb RCDB]
  
 
=== Monitoring Output Files ===
 
=== Monitoring Output Files ===
Line 18: Line 19:
  
 
=== Monitoring Webpages ===
 
=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
+
*[https://halldweb.jlab.org/wiki/index.php/Monitoring_webpage_help Help]
 +
*[https://halldweb.jlab.org/data_monitoring/Plot_Browser.html Plot Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/versionBrowser.py Version Browser]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
 
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]
 +
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/recontestBrowser.py Recon Tests]
  
== Job Monitoring Links ==
+
== SciComp Job Links ==
 +
=== Main ===
 +
* [https://scicomp.jlab.org/scicomp/ Scientific Computing Home Page]
 +
* [https://scicomp.jlab.org/scicomp/#/auger/jobs Auger Job Status Page]
 +
* [https://scicomp.jlab.org/scicomp/#/jasmine/jobs JasMine Tape Job Status Page]
 +
 
 +
=== Documentation ===
 +
* [https://scicomp.jlab.org/docs/batch Batch System]
 +
* [https://scicomp.jlab.org/docs/storage Mass Storage System]
 +
* [https://scicomp.jlab.org/docs/write-through-cache Write-Through Cache]
 +
* [https://scicomp.jlab.org/docs/swif SWIF]
 +
* [https://scicomp.jlab.org/docs/swif-cli SWIF Command Line]
 +
 
 +
=== Job Tracking ===
 
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
 
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
 
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
 
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
Line 30: Line 47:
 
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
 
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]
  
== Saving Online Monitoring Data ==
+
== Procedures: Overview ==
 
+
The procedure for writing the data out is given in, e.g.,
+
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].
+
 
+
Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
+
and within ~20 min., we will have access to the file on tape at
+
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.
+
 
+
All online monitoring plugins will be run as data is taken.
+
They will be accessible within the counting house via RootSpy, and
+
for each run and file, a ROOT file containing the histograms will be saved
+
within a subdirectory for each run.
+
 
+
For immediate access to these files, the raid disk files may be accessed directly
+
from the counting house, or the tape files will be available within ~20 min. of the
+
file being written out.
+
 
+
== Offline Monitoring: Running Over Archived Data ==
+
 
+
Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
+
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.
+
 
+
Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
+
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.
+
 
+
Below the procedures are described for
+
# Preparing the software for the launch
+
# Starting the launch (using hdswif)
+
# Post-analysis of statistics of the launch
+
 
+
Processing the results and making them available to the collaboration
+
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.
+
 
+
<!--
+
==== Using cron to run automatically ====
+
Within /home/gluex/halld/monitoring/cron/ there is a file cron_plugins
+
that can be executed via
+
crontab cron_plugins
+
This will set up a cron job to call the script scan_for_jobs.sh, which will
+
check in the rawdata directory and call generatejobs_plugins_rawdata.sh for
+
any run that is more than 5 min old. The cron job is set up to run every 10 min.
+
-->
+
 
+
=== General Information on Procedures ===
+
Since we may want to simultaneously run offline monitoring for different run periods that require
+
different environment variables, the scripts are set up so that a generic user can download the
+
scripts and run them from anywhere. Most output directories for offline monitoring are created
+
with group read/write permissions so that any Hall D group user has access to the contents,
+
but there are some cases where use of the account that created the launch is necessary.
+
 
+
The accounts used for offline monitoring are the gxprojN accounts created and maintained by
+
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
+
As of October 2015, the following are used:
+
* gxproj1 for running over incoming experimental data (as it hits the tape)
+
* gxproj5 for running over previous experimental data (biweekly launches)
+
 
+
For offline monitoring, the hdswif system that Kei developed is used for launching the jobs, and a new cross analysis
+
system based on MySQL and Python is maintained.
+
 
+
The scripts for the monitoring are maintained in svn:
+
https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/
+
 
+
=== Preparing the software for the launch ===
+
 
+
1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>
+
 
+
2. Updating & building hdds:
+
<pre>
+
cd $HDDS_HOME
+
git pull                # Get latest software
+
scons -c install        # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
+
scons install -j4      # Rebuild and re-install with 4 threads
+
</pre>
+
 
+
3. Updating & building sim-recon:
+
<pre>
+
cd $HALLD_HOME/src
+
git pull                # Get latest software
+
scons -c install        # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
+
scons install -j4      # Rebuild and re-install with 4 threads
+
</pre>
+
 
+
4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
+
<pre>
+
cd $GLUEX_MYTOP/../sqlite/
+
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
+
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
+
</pre>
+
 
+
5. Note that the above steps must be done <b>BEFORE</b> launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.
+
 
+
Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.
+
 
+
=== Starting the Launch and Submitting Jobs ===
+
 
+
1. Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/monitoring/hdswif.
+
<pre>
+
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
+
cd hdswif
+
</pre>
+
 
+
2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
+
PROJECT                      gluex
+
TRACK                        reconstruction
+
OS                            centos65
+
NCORES                        6
+
DISK                          40
+
RAM                          8
+
TIMELIMIT                    8
+
JOBNAMEBASE                  offmon_
+
RUNPERIOD                    2015-03
+
VERSION                      15
+
OUTPUT_TOPDIR                /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other  variables included in variable
+
SCRIPTFILE                    /home/gxproj5/monitoring/hdswif/script.sh                            # Must specify full path
+
ENVFILE                      /home/gxproj5/env_monitoring_launch                                  # Must specify full path
+
 
+
3. Creating the workflow: Within SWIF jobs are registered into workflows.  For offline monitoring, the workflow names are of the form <b>offmon_20YY_MM_verVV</b> with suitable replacements for the run period year YY, month BB, and the version number VV (with leading zeroes). The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
+
swif list
+
 
+
For creation of workflows for offline monitoring the command:
+
hdswif.py create [workflow] -c input.config
+
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
+
/group/halld/data_monitoring/run_conditions/RunPeriod-2015-03/jana_rawdata_comm_2015_03_ver15.conf
+
/group/halld/data_monitoring/run_conditions/RunPeriod-2015-03/soft_comm_2015_03_ver15.xml
+
 
+
The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
+
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
+
 
+
4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
+
<span style="color:red">Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.</span>
+
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
+
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
+
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
+
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).
+
 
+
5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
+
hdswif.py run [workflow]
+
 
+
<b>It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs.</b><br><b>Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
+
* Check stderr files. Are they small (<kB)?
+
* Check stdout files. Are they very large (>MB)?
+
* Check output ROOT files. Are they larger than several MB?
+
* Check output REST files. Are they larger than several tens of MB?
+
</b>
+
<br> For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
+
in which case only 10 jobs will be submitted.
+
 
+
To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>
+
 
+
=== Checking the Status and Resubmitting ===
+
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away.  Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].
+
 
+
2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
+
This only re-stages the jobs, be sure to resubmit them with:
+
<pre>swif run -workflow [workflow] -errorlimit none</pre>
+
 
+
hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of <b>SYSTEM, TIMEOUT, RLIMIT</b>. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. <b>Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.</b>
+
 
+
3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf
+
 
+
4. Below is a table describing the various errors that can occur.
+
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
+
!width="150"| ERROR NAME
+
!width="400"| Description
+
!width="400"| Resolution
+
!width="400"| hdswif command
+
|-
+
| AUGER-SUBMIT
+
||
+
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
+
||
+
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
+
||
+
<span style="color:purple"><b>hdswif.py resubmit [workflow] SYSTEM</b></span>
+
|-
+
| AUGER-FAILED
+
||
+
Auger reports the job FAILED with no specific details.
+
||
+
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
+
||
+
<span style="color:purple"><b>hdswif.py resubmit [workflow] SYSTEM</b></span>
+
|-
+
| AUGER-OUTPUT-FAIL
+
||
+
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
+
||
+
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
+
||
+
<span style="color:purple"><b>hdswif.py resubmit [workflow] SYSTEM</b></span>
+
|-
+
| AUGER-INPUT-FAIL
+
||
+
Auger failed to copy one or more of the requested input files, similar to output failures.  Can also happen if tape file is unavailable (e.g. missing/damaged tape)
+
||
+
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
+
||
+
<span style="color:purple"><b>hdswif.py resubmit [workflow] SYSTEM</b></span>
+
|-
+
| AUGER-TIMEOUT
+
||
+
Job timed out.
+
||
+
If more time is needed for job add more resources.
+
Default is to add 2 hrs of processing time. Also check whether code is hanging.
+
||
+
<span style="color:red"><b>hdswif.py resubmit [workflow] TIMEOUT</b></span><br>
+
Default is to add 2 hours. Optionally specify number of hours at end.
+
|-
+
| AUGER-OVER_RLIMIT
+
||
+
Not enough resources, RAM or disk space.
+
||
+
Add more resources for job.
+
||
+
<span style="color:blue"><b>hdswif.py resubmit [workflow] RLIMIT</b></span><br>
+
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
+
|-
+
| SWIF-MISSING-OUTPUT
+
||
+
Output file specified by user was not found.
+
||
+
Check if output file exists at end of job.
+
||
+
 
+
|-
+
| SWIF-USER-NON-ZERO
+
||
+
User script exited with non-zero status code.
+
||
+
Your script exited with non-zero status. Check the code you are running.
+
||
+
 
+
|-
+
| SWIF-SYSTEM-ERROR
+
||
+
Job failed owing to a problem with swif (e.g. network connection timeout)
+
||
+
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
+
||
+
<span style="color:purple"><b>hdswif.py resubmit [workflow] SYSTEM</b></span>
+
|}
+
<br style="clear:both;"/>
+
 
+
=== Post-analysis of statistics of the launch ===
+
 
+
After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
+
The next step is to check the resource usage for the current launch and publish the results online.
+
 
+
# <b>Create summary XML, HTML files</b><br> The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
+
# <b>Publish output files online</b><br> At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the cross_analysis scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis. For the gxprojN accounts this directory should exist as ~/halld/monitoring/cross_analysis. To publish the results online do for example <pre>python ~/halld/monitoring/cross_analysis/publish_offmon_results.py 2015_03 18</pre> The script copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/ and also creates a link to it in the html page.
+
# <b>Editing the summary HTML page</b><br> The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are <pre>/group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period]/[run period].html </pre> Edit the file to:
+
## Add a new line to the first table which contains the version number, date, and comments for the current launch
+
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
+
# <b>Freezing SWIF tables</b><br> Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
+
# <b>Backing up SWIF output</b><br> With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/xml/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>
+
 
+
=== Cross Analysis of Launches ===
+
 
+
The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
+
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.
+
 
+
#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/monitoring/cross_analysis
+
 
+
# The main script is run_cross_analysis.sh, which can be run with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre>, where e.g. [RUNPERIOD] = 2015_03 and [VERSION] = 22. However, <b>it is strongly recommended that the commands in this script be run by hand</b> to catch any errors.
+
 
+
# Enter the python commands that are in run_cross_analysis.sh . Below are the steps and explanations:
+
## Create a table for the current launch using <pre>./create_cross_analysis_table.sh [RUNPERIOD] [VERSION]</pre>. The table will be created from the file template_table_schema.sql and contain columns id, run, file, timeChange, cpu_sec, wall_sec, mem_kb, vmem_kb, nevents, input_copy_sec, plugin_sec, final_state, problems
+
## Run <pre>python fill_cross_analysis_entries.py [RUNPERIOD] [VERSION]</pre> The script will gather all of the necessary information either from SWIF output or the stdout files for the jobs
+
## Run <pre>python create_stats_table_row.py [RUNPERIOD] [VERSION]</pre> This will loop over the jobs in the current launch and create a row in an HTML table that summarizes the statistics for the final state and problems for the jobs. This table row is then inserted into the main HTML webpage for the run period.
+
## Run <pre>python create_stats_for_each_file.py [RUNPERIOD] [MINVERSION] [VERSION]</pre> This creates a comparison table of the final state and problems for each file between launches [MINVERSION] and [VERSION]. In the run_cross_analysis.sh script, the default is to set MINVERSION to be 15 for run period 2015_03, but as long as SWIF was used for all previous launches, any number will work.
+
## Run <pre>python create_resource_correlation_plots.py [RUNPERIOD] [CMPMINVERSION] [VERSION]</pre> This creates correlation plots of resource use between launches between CMPMINVERSION and VERSION. By default CMPMINVERSION is 5 launches earlier.
+
 
+
== Explanation of Scripts ==
+
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.
+
 
+
=== hdswif scripts ===
+
<b>Summary:</b> hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.
+
 
+
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
+
!width="150"| file name
+
!width="800"| Description
+
|-
+
| hdswif.py ||
+
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
+
|-
+
| parse_swif.py ||
+
Called within hdswif.py to create html output from SWIF results.
+
|-
+
| createXMLfiles.py ||
+
* Called by the "create" option of hdswif.py
+
* Creates XML files for logging information about launch. Must specify config file with option -c.
+
* Also adds tags to git repositories of sim-recon and hdds.
+
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
+
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
+
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
+
* For each launch, <b>the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.</b>
+
|-
+
| read_config.py ||
+
* Called by the "add" option of hdswif.py
+
* Takes in config file name, optionally set verbose
+
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
+
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
+
|-
+
| output_job_details.py ||
+
* Called by the "details" option of hdswif.py
+
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
+
* Finds ids for jobs specified by the run and file number and returns info on each one
+
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
+
|-
+
| results_by_resources.py ||
+
* Called within parse_swif.py
+
* Takes in XML output from SWIF and creates 2 plots
+
* Creates html table showing results of jobs by resources requested for that job
+
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
+
|-
+
| create_ordered_hists.py ||
+
* Called within parse_swif.py
+
* Takes in XML output from SWIF and creates 2 plots
+
* Plots are dependency time and pending time of each job, ordered by submission.
+
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
+
|-
+
| create_stacked_times.py ||
+
* Called within parse_swif.py
+
* Takes in XML output from SWIF and creates 2 plots
+
* Plots show total job time divided into colors for different stages
+
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
+
* The jobs show at a glance which stage contributes how much of the job's time
+
|}
+
 
+
=== Utility scripts ===
+
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
+
!width="150"| file name
+
!width="800"| Description
+
|-
+
| stderr_by_size.py ||
+
* Independent of all other scripts in hdswif directory
+
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
+
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, <b>in separate directories given by stderr file size</b>.
+
|}
+
 
+
=== cross_analysis scripts ===
+
<b>Summary:</b> The script run_cross_analysis.sh is the main script. In principle, running this with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre> should work, but it is recommended that the commands within the script
+
be run by hand to catch any errors.
+
 
+
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
+
!width="150"| file name
+
!width="800"| Description
+
|-
+
| run_cross_analysis.sh ||
+
* Main script to call other Python scripts.
+
* Takes in run period and version as arguments and runs cross analysis
+
|-
+
| create_cross_analysis_table.sh ||
+
* Creates MySQL table for current launch that is used by other scripts
+
* Takes run period and version as input, and creates table cross_analysis_table_[RUNPERIOD]_ver[VERSION]
+
|-
+
| create_stats_table_row.py ||
+
* Create a row in the html table showing the overall statistics of final states and problems for a launch
+
* The final states are "Success", and "Segfault". "Success" includes all jobs that had problems but still finished with Success.
+
* The problems are "Over Limit", "Timeout", and "System". If any of these occurred for any attempt of the job they are counted.
+
* The script creates a new html table row for the current launch. This html snippet is inserted into the web-accessible file showing the results
+
|-
+
| create_stats_for_each_file.py ||
+
* Create html tables showing the status of each file against different launch versions.
+
* Takes in run period, min version, max version and shows final result and problems for all versions in between
+
* Different final results and problems are shown by combinations of ext content, olor coding of text, background color
+
|-
+
|-
+
| create_resource_correlation_plots.py ||
+
* Create plots showing correlation of resources between different launches for each file.
+
* Takes in run period, min version, version of interest. Points are shown only for files included in version of interest
+
* Creates plots for CPU time, Wall time, memory, virtual memory, #events, difference in #events, time to copy input evio file, time to run plugin
+
|}
+
 
+
== Running Over Data As It Comes In ==
+
 
+
A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
+
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
+
run the previous Friday. The procedure for this is shown below.
+
 
+
<!--
+
=== Setting up the environment ===
+
The file
+
/home/gxproj1/setup_jlab.csh
+
is sourced through .tcshrc.
+
This file is the same as what is linked to by
+
/home/gluex/setup_jlab_commissioning.csh,
+
except HALLD_HOME, HDDS_HOME, and JANA_CALIB_URL are set separately so that this
+
user can have a separate build.
+
 
+
To obtain the builds from the previous Friday's runs,
+
execute
+
/home/gxproj1/halld/monitoring/newruns/setup_previous.sh [year] [month] [day]
+
The build revisions from the previous Friday are archived in files
+
/work/halld/data_monitoring/run_conditions/soft_comm_[year]_[month]_[day].xml
+
and the script will build libraries based on those stored revision numbers.
+
-->
+
 
+
=== Running the cron job ===
+
 
+
'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.
+
 
+
* Go to the cron job directory:
+
<pre>
+
cd /u/home/gxproj1/halld/monitoring/newruns
+
</pre>
+
 
+
* The cron_plugins file is the cronjob that will be executed.  During execution, it runs the exec.sh command in the same folder.  This command takes two arguments: the project name, and the maximum file number for each run.  These fields should be updated in the cron_plugins file before running.
+
 
+
* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number.  It then submits jobs for these files. 
+
 
+
* To start the cron job, run:
+
<pre>
+
crontab cron_plugins
+
</pre>
+
 
+
* To check whether the cron job is running, do
+
<pre>
+
crontab -l
+
</pre>
+
 
+
* To remove the cron job do
+
<pre>
+
crontab -r
+
</pre>
+
 
+
<!--
+
The cron job will run the script scan_for_jobs.sh,
+
which runs generatejobs_plugins_rawdata.sh for any
+
new runs that it had not seen before. All previous
+
runs are recorded in the file filelists/files_current.txt
+
so clear this to run over runs, or set the parameters
+
MINRUN and MAXRUN which will set the range of runs submitted.
+
-->
+
 
+
==Post-Processing Procedures==
+
 
+
To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page.  This section describes how to generate the monitoring images and database information.
+
 
+
The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process .  If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
+
<syntaxhighlight>
+
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
+
</syntaxhighlight>
+
Note that these scripts currently have some parameters which must be periodically set by hand.
+
 
+
The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database.  To run these scripts, load the environment with the following command
+
<syntaxhighlight>
+
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
+
</syntaxhighlight>
+
 
+
===Online Monitoring===
+
 
+
There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
+
<syntaxhighlight>
+
/home/gluex/halld/monitoring/process/check_new_runs.py
+
 
+
OR
+
 
+
/home/gluex/halld/monitoring/process/check_new_runs.csh
+
</syntaxhighlight>
+
The shell script sets up the environment properly to run the python script.  To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed.  The shell script is appropriate to use in a cron job.  The cronjob is currently run under the "gluex" account.
+
 
+
The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house.  This python script automatically checks for new ROOT files, which it will then automatically process.  It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...
+
  
===Offline Monitoring===
+
=== Online Monitoring: During Experimental Running ===
  
After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages.  Currently, this processing is controlled by a cronjob that runs the following script:
+
After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.
<syntaxhighlight>
+
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh 
+
</syntaxhighlight>
+
This script checks for new ROOT files, and only runs over those it hasn't processed yet.  Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.
+
  
Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros.  If you want to change the list of plots made, you must modify one of the following files:
+
This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
+
* macros_to_monitor - specify the full path to the RootSpy macro .C file
+
  
When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
+
For more details on the online monitoring system, see [https://halldweb.jlab.org/hdops/wiki/index.php/Online_Monitoring_Shift this page].
# Add a new data version, as described below:
+
# Change the following parameters in check_monitoring_data.csh:
+
## JOBDATE should correspond to the ouptut date used by the job submission script
+
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
+
## Once you create a new data version as defined below, you should pass the needed information as a command line optionCurrently this is done by the ARGS variable.  For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.
+
  
<syntaxhighlight>
+
=== Offline Monitoring and Reconstruction: During Experimental Running ===
Example configuration parameters:
+
set JOBDATE=2015-01-09
+
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
+
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
+
set ARGS=" -v RunPeriod-2014-10,8 "
+
</syntaxhighlight>
+
If you want to process the results manually, the data is processed using the following script:
+
<syntaxhighlight>
+
./process_new_offline_data.py <input directory> <output directory>
+
  
EXAMPLE:
+
During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:  
  
./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
+
# '''Incoming:''' Monitor the first <span style="color:red">5</span> files of each newly-recorded run as soon as it hits the tape.
</syntaxhighlight>
+
# '''Monitoring Launches:''' Every <span style="color:red">two</span> weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.  
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.
+
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
  
Every time a new reconstruction pass is performed, a new version number must be generatedTo do this, prepare a version file as described below.  Then run the register_new_version.py script to store the information in the database.  The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up withAlso, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.
<syntaxhighlight>
+
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
+
</syntaxhighlight>
+
  
<b>If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
+
=== Offline Monitoring and Reconstruction: After Experimental Running ===
of the svn repository</b>, and created a project with
+
create_project.sh [project name] hd_rawdata
+
Then go to the directory [project name]/processing/
+
and execute
+
./run_processing.sh
+
which will run register_new_version.py  as well as check_monitoring_data.csh for that project.
+
  
===Step-by-Step Instructions For Processing a New Monitoring Run===
+
After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:
  
The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.
+
# '''Monitoring Launches:''' Every two weeks, do a monitoring launch over the first <span style="color:red">5</span> files of all runs currently available on the tape.
 +
# '''Initial Reconstruction Launch:''' As soon as a new group (e.g. <span style="color:red">~100</span> runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
 +
# '''Further Reconstruction Launches:''' Every <span style="color:red">~3</span> months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data. 
 +
#* We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.  
  
# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
+
Note that the monitoring is limited to the first <span style="color:red">5</span> files of each run, since there will be a significant amount of data.
# Run "svn update" to bring any changes in.  Be sure that the list of histograms and macros to plot are current.
+
# Add a new data version
+
# Edit check_monitoring_data.csh to point to the current revisions/directories
+
#* VERSION
+
#* ARGS
+
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
+
# Update files in the web directory, so that the results are displayed on the web pages:  /group/halld/www/halldweb/html/data_monitoring/textdata
+
# Copy the REST files to more permanent locations:
+
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
+
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV  [under testing]
+
  
Check log files in  $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.
+
=== Saving to Tape (Write-through Cache): Monitoring Launches ===
 +
All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:
 +
* REST files: All files.
 +
* ROOT files: One merged file per run.  
 +
** After merge, the individual files are deleted (so they won't be saved).  
 +
* Job stdout/stderr: One tarball per run
 +
** After launch analysis, the log files are deleted (so they won't be saved).  
 +
* Browser png's: One tarball per launch
  
==Data Versions==
+
=== Saving to Tape (Write-through Cache): Full Reconstruction Launches ===
 +
* REST files: All files.
 +
* ROOT files: All files, <span style="color:blue">AND</span> one merged file per run.
 +
* Job stdout/stderr: One tarball per run
 +
** After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
 +
* Browser png's: One tarball per launch
  
To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information.  The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.
 
  
We store one record per pass through one run period, with the following structure:
+
== Procedures: Details ==
  
{| class="wikitable"
+
* [[Offline_Monitoring_Incoming_Data | Offline Monitoring: Running Over Incoming Data]]
! Field !! Description
+
* [[Offline_Monitoring_Archived_Data | Offline Monitoring: Running Over Archived Data]]
|-
+
* [[Offline_Monitoring_Post_Processing | Offline Monitoring: Post-Processing]]
| data_type || The level of data we are processing.  For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
+
* [[DEPRECATED_Offline_Monitoring_Archived_Data | DEPRECATED (Except plots): Offline Monitoring: Running Over Archived Data]]
|-  
+
* [[DSelector_SWIF_Jobs | DSelector SWIF Jobs]]
| run_period || The run period of the data
+
* [[Merging_Analysis_Trees | Analysis Launch: Merging Trees]]
|-
+
| revision || An integer specifying which pass through the run period this data corresponds to
+
|-
+
| software_version || The name of the XML file that specifies the different software versions used
+
|-
+
| jana_config  || The name of the text file that specifies which JANA options were passed to the reconstruction program
+
|-
+
| ccdb_context  || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
+
|-
+
| production_time  || The data at which monitoring/reconstruction began
+
|-
+
| dataVersionString  || A convenient string for identifying this version of the data
+
|}
+
  
 +
=== On- and Offline Monitoring Data Validation===
 +
* [[Offline_Monitoring_Data_Validation | Offline Monitoring: Data Validation]]
 +
* [[Offline_Monitoring_Data_Validation_PrimEx | Offline Monitoring: Data Validation of PrimEx data]]
 +
* [[Offline_Monitoring_Data_Validation_CPP | Offline Monitoring: Data Validation of CPP data]]
 +
* [[Online_Monitoring_Data_Validation | Online Monitoring: Data Validation]]
  
An example file used as as input to ./register_new_version.py is:
+
== Software Tests ==
<syntaxhighlight>
+
* [[Software_Test_Data_Recon | Software Test: Experimental Data Reconstruction]]
data_type          = recon
+
** [https://halldweb.jlab.org/recon_test/ Test Results]
run_period          = RunPeriod-2014-10
+
revision            = 1
+
software_version    = soft_comm_2014_11_06.xml
+
jana_config        = jana_rawdata_comm_2014_11_06.conf
+
ccdb_context        = calibtime=2014-11-10
+
production_time    = 2014-11-10
+
dataVersionString  = recon_RunPeriod-2014-10_20141110_ver01
+
</syntaxhighlight>
+

Latest revision as of 13:18, 2 November 2023

Master List of File / Database / Webpage Locations

Run Conditions

  • Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
  • Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
  • Run Info vers. 1
  • Run Info vers. 2
  • RCDB

Monitoring Output Files

  • Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
  • Online monitoring histograms: /work/halld/online_monitoring/root/
  • Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
  • individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

Monitoring Database

  • Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

Monitoring Webpages

SciComp Job Links

Main

Documentation

Job Tracking

Procedures: Overview

Online Monitoring: During Experimental Running

After every run is finished, a ROOT file containing histograms from the online monitoring system and a file containing some run conditions are copied to directories under /work/halld/online_monitoring . A cronjob running in the counting house performs this function.

This ROOT file is processed similarly to the offline monitoring results, and are made available under the same webpages as "ver00" of the relevant run period.

For more details on the online monitoring system, see this page.

Offline Monitoring and Reconstruction: During Experimental Running

During experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Incoming: Monitor the first 5 files of each newly-recorded run as soon as it hits the tape.
  2. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  3. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, because data is being recorded to tape at a faster rate than the monitoring can keep up with. Also, during the experimental run, each run will only be fully-reconstructed once, because it will be difficult enough to keep up with the incoming data.

Offline Monitoring and Reconstruction: After Experimental Running

After experimental running, the following offline monitoring procedures should be performed, each with a different gxprojN account, so that they don't interfere with each other:

  1. Monitoring Launches: Every two weeks, do a monitoring launch over the first 5 files of all runs currently available on the tape.
  2. Initial Reconstruction Launch: As soon as a new group (e.g. ~100 runs) of data is initially semi-well calibrated, do a preliminary full-reconstruction launch over all files in that group.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.
  3. Further Reconstruction Launches: Every ~3 months, if there have been significant improvements to the reconstruction / calibrations, do a new full-reconstruction launch over all of the data.
    • We can add user analysis plugins to this launch, including those with ROOT TTree output, provided that they work and don't take much memory.

Note that the monitoring is limited to the first 5 files of each run, since there will be a significant amount of data.

Saving to Tape (Write-through Cache): Monitoring Launches

All job output will be directly written to the write-thru cache. However, only the following will be saved to tape:

  • REST files: All files.
  • ROOT files: One merged file per run.
    • After merge, the individual files are deleted (so they won't be saved).
  • Job stdout/stderr: One tarball per run
    • After launch analysis, the log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch

Saving to Tape (Write-through Cache): Full Reconstruction Launches

  • REST files: All files.
  • ROOT files: All files, AND one merged file per run.
  • Job stdout/stderr: One tarball per run
    • After launch analysis, a tarball is created and the individual log files are deleted (so they won't be saved).
  • Browser png's: One tarball per launch


Procedures: Details

On- and Offline Monitoring Data Validation

Software Tests