GlueXWiki - User contributions [en]

Data Monitoring Procedures

2015-12-15T06:37:16Z

Kmoriya: /* Offline Monitoring: Running Over Archived Data */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]
*[https://halldweb.jlab.org/rcdb RCDB]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files (merged): /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* individual files for each job (ROOT, REST, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/versionBrowser.py Version Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== SciComp Job Links ==
=== Main ===
* [https://scicomp.jlab.org/scicomp/ Scientific Computing Home Page]
* [https://scicomp.jlab.org/scicomp/#/auger/jobs Auger Job Status Page]
* [https://scicomp.jlab.org/scicomp/#/jasmine/jobs Jasmine Tape Job Status Page]

=== Documentation ===
* [https://scicomp.jlab.org/docs/batch Batch System]
* [https://scicomp.jlab.org/docs/storage Mass Storage System]
* [https://scicomp.jlab.org/docs/swif SWIF]
* [https://scicomp.jlab.org/docs/swif-cli SWIF Command Line]

=== Job Tracking ===
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over incoming experimental data (as it hits the tape)
* gxproj5 for running over previous experimental data (biweekly launches)

For offline monitoring, the hdswif system that Kei developed is used for launching the jobs, and a new cross analysis
system based on MySQL and Python is maintained.

The scripts for the monitoring are maintained in svn:
https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

1. Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/monitoring/hdswif.
<pre>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/monitoring/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/env_monitoring_launch # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offmon_20YY_MM_verVV with suitable replacements for the run period year YY, month BB, and the version number VV (with leading zeroes). The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/RunPeriod-2015-03/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/RunPeriod-2015-03/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the cross_analysis scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis. For the gxprojN accounts this directory should exist as ~/halld/monitoring/cross_analysis. To publish the results online do for example <pre>python ~/halld/monitoring/cross_analysis/publish_offmon_results.py 2015_03 18</pre> The script copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/ and also creates a link to it in the html page.
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are <pre>/group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period]/[run period].html </pre> Edit the file to:
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/xml/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/monitoring/cross_analysis

# The main script is run_cross_analysis.sh, which can be run with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre>, where e.g. [RUNPERIOD] = 2015_03 and [VERSION] = 22. However, it is strongly recommended that the commands in this script be run by hand to catch any errors.

# Enter the python commands that are in run_cross_analysis.sh . Below are the steps and explanations:
## Create a table for the current launch using <pre>./create_cross_analysis_table.sh [RUNPERIOD] [VERSION]</pre>. The table will be created from the file template_table_schema.sql and contain columns id, run, file, timeChange, cpu_sec, wall_sec, mem_kb, vmem_kb, nevents, input_copy_sec, plugin_sec, final_state, problems
## Run <pre>python fill_cross_analysis_entries.py [RUNPERIOD] [VERSION]</pre> The script will gather all of the necessary information either from SWIF output or the stdout files for the jobs
## Run <pre>python create_stats_table_row.py [RUNPERIOD] [VERSION]</pre> This will loop over the jobs in the current launch and create a row in an HTML table that summarizes the statistics for the final state and problems for the jobs. This table row is then inserted into the main HTML webpage for the run period.
## Run <pre>python create_stats_for_each_file.py [RUNPERIOD] [MINVERSION] [VERSION]</pre> This creates a comparison table of the final state and problems for each file between launches [MINVERSION] and [VERSION]. In the run_cross_analysis.sh script, the default is to set MINVERSION to be 15 for run period 2015_03, but as long as SWIF was used for all previous launches, any number will work.
## Run <pre>python create_resource_correlation_plots.py [RUNPERIOD] [CMPMINVERSION] [VERSION]</pre> This creates correlation plots of resource use between launches between CMPMINVERSION and VERSION. By default CMPMINVERSION is 5 launches earlier.

=== Starting a new run period ===
When a new run period is started, it is best to make sure all top-level directories are created with the right permissions. This can save headaches later on when a different gxprojN account is used for offline monitoring.
# Create top-level directories<pre> /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/</pre> <pre>/group/halld/data_monitoring/run_conditions/RunPeriod-20YY-MM/</pre>
#Make sure other gxprojN users can write in with chmod g+w [dir name]. Check that permissions match those from previous run periods.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Called by the "create" option of hdswif.py
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
| read_config.py ||
* Called by the "add" option of hdswif.py
* Takes in config file name, optionally set verbose
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
|-
| output_job_details.py ||
* Called by the "details" option of hdswif.py
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
* Finds ids for jobs specified by the run and file number and returns info on each one
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
|-
| results_by_resources.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Creates html table showing results of jobs by resources requested for that job
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
|-
| create_ordered_hists.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots are dependency time and pending time of each job, ordered by submission.
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
|-
| create_stacked_times.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots show total job time divided into colors for different stages
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
* The jobs show at a glance which stage contributes how much of the job's time
|}

=== Utility scripts ===
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| stderr_by_size.py ||
* Independent of all other scripts in hdswif directory
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, in separate directories given by stderr file size.
|}

=== cross_analysis scripts ===
Summary: The script run_cross_analysis.sh is the main script. In principle, running this with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre> should work, but it is recommended that the commands within the script
be run by hand to catch any errors.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| run_cross_analysis.sh ||
* Main script to call other Python scripts.
* Takes in run period and version as arguments and runs cross analysis
|-
| create_cross_analysis_table.sh ||
* Creates MySQL table for current launch that is used by other scripts
* Takes run period and version as input, and creates table cross_analysis_table_[RUNPERIOD]_ver[VERSION]
|-
| create_stats_table_row.py ||
* Create a row in the html table showing the overall statistics of final states and problems for a launch
* The final states are "Success", and "Segfault". "Success" includes all jobs that had problems but still finished with Success.
* The problems are "Over Limit", "Timeout", and "System". If any of these occurred for any attempt of the job they are counted.
* The script creates a new html table row for the current launch. This html snippet is inserted into the web-accessible file showing the results
|-
| create_stats_for_each_file.py ||
* Create html tables showing the status of each file against different launch versions.
* Takes in run period, min version, max version and shows final result and problems for all versions in between
* Different final results and problems are shown by combinations of ext content, olor coding of text, background color
|-
|-
| create_resource_correlation_plots.py ||
* Create plots showing correlation of resources between different launches for each file.
* Takes in run period, min version, version of interest. Points are shown only for files included in version of interest
* Creates plots for CPU time, Wall time, memory, virtual memory, #events, difference in #events, time to copy input evio file, time to run plugin
|}

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-12-11T21:39:07Z

Kmoriya: /* Cross Analysis of Launches */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing.
For offline monitoring, the hdswif system that Kei developed is used for launching the jobs, and a new cross analysis
system based on MySQL and Python is maintained. The jproj system is now deprecated for offline monitoring.

The scripts for hdswif, cross analysis and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* cross analysis: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

1. Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif.
<pre>
cd ~/halld
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the cross_analysis scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis. For the gxprojN accounts this directory should exist as ~/halld/monitoring/cross_analysis. To publish the results online do for example <pre>python ~/halld/monitoring/cross_analysis/publish_offmon_results.py 2015_03 18</pre> The script copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/ and also creates a link to it in the html page.
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period]/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/xml/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/monitoring/cross_analysis
# The main script is run_cross_analysis.sh, which can be run with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre>, where e.g. [RUNPERIOD] = 2015_03 and [VERSION] = 22. However, it is strongly recommended that the commands in this script be run by hand to catch any errors.
# Enter the python commands that are in run_cross_analysis.sh . Below are the steps and explanations:
## Create a table for the current launch using <pre>./create_cross_analysis_table.sh [RUNPERIOD] [VERSION]</pre>. The table will be created from the file template_table_schema.sql and contain columns id, run, file, timeChange, cpu_sec, wall_sec, mem_kb, vmem_kb, nevents, input_copy_sec, plugin_sec, final_state, problems
## Run <pre>python fill_cross_analysis_entries.py [RUNPERIOD] [VERSION]</pre> The script will gather all of the necessary information either from SWIF output or the stdout files for the jobs
## Run <pre>python create_stats_table_row.py</pre> This will loop over the jobs in the current launch and create a row in an HTML table that summarizes the statistics for the final state and problems for the jobs. This table row is then inserted into the main HTML webpage for the run period.
## Run <pre>python create_stats_for_each_file.py [RUNPERIOD] [MINVERSION] [VERSION]</pre> This creates a comparison table of the final state and problems for each file between launches [MINVERSION] and [VERSION]. In the run_cross_analysis.sh script, the default is to set MINVERSION to be 15 for run period 2015_03, but as long as SWIF was used for all previous launches, any number will work.
## Run <pre>python create_resource_correlation_plots.py [RUNVERSION] [CMPMINVERSION] [VERSION]</pre> This creates correlation plots of resource use between launches between CMPMINVERSION and VERSION. By default CMPMINVERSION is 5 launches earlier.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Called by the "create" option of hdswif.py
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
| read_config.py ||
* Called by the "add" option of hdswif.py
* Takes in config file name, optionally set verbose
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
|-
| output_job_details.py ||
* Called by the "details" option of hdswif.py
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
* Finds ids for jobs specified by the run and file number and returns info on each one
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
|-
| results_by_resources.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Creates html table showing results of jobs by resources requested for that job
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
|-
| create_ordered_hists.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots are dependency time and pending time of each job, ordered by submission.
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
|-
| create_stacked_times.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots show total job time divided into colors for different stages
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
* The jobs show at a glance which stage contributes how much of the job's time
|}

=== Utility scripts ===
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| stderr_by_size.py ||
* Independent of all other scripts in hdswif directory
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, in separate directories given by stderr file size.
|}

=== cross_analysis scripts ===
Summary: The script run_cross_analysis.sh is the main script. In principle, running this with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre> should work, but it is recommended that the commands within the script
be run by hand to catch any errors.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| run_cross_analysis.sh ||
* Main script to call other Python scripts.
* Takes in run period and version as arguments and runs cross analysis
|-
| create_cross_analysis_table.sh ||
* Creates MySQL table for current launch that is used by other scripts
* Takes run period and version as input, and creates table cross_analysis_table_[RUNPERIOD]_ver[VERSION]
|-
| create_stats_table_row.py ||
* Create a row in the html table showing the overall statistics of final states and problems for a launch
* The final states are "Success", and "Segfault". "Success" includes all jobs that had problems but still finished with Success.
* The problems are "Over Limit", "Timeout", and "System". If any of these occurred for any attempt of the job they are counted.
* The script creates a new html table row for the current launch. This html snippet is inserted into the web-accessible file showing the results
|-
| create_stats_for_each_file.py ||
* Create html tables showing the status of each file against different launch versions.
* Takes in run period, min version, max version and shows final result and problems for all versions in between
* Different final results and problems are shown by combinations of ext content, olor coding of text, background color
|-
|-
| create_resource_correlation_plots.py ||
* Create plots showing correlation of resources between different launches for each file.
* Takes in run period, min version, version of interest. Points are shown only for files included in version of interest
* Creates plots for CPU time, Wall time, memory, virtual memory, #events, difference in #events, time to copy input evio file, time to run plugin
|}

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-12-11T15:38:22Z

Kmoriya: /* Post-analysis of statistics of the launch */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing.
For offline monitoring, the hdswif system that Kei developed is used for launching the jobs, and a new cross analysis
system based on MySQL and Python is maintained. The jproj system is now deprecated for offline monitoring.

The scripts for hdswif, cross analysis and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* cross analysis: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

1. Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif.
<pre>
cd ~/halld
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the cross_analysis scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis. For the gxprojN accounts this directory should exist as ~/halld/monitoring/cross_analysis. To publish the results online do for example <pre>python ~/halld/monitoring/cross_analysis/publish_offmon_results.py 2015_03 18</pre> The script copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/ and also creates a link to it in the html page.
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/monitoring/cross_analysis
# The main script is run_cross_analysis.sh, which can be run with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre>. However, it is strongly recommended that the commands in this script be run by hand to catch any errors.
# Enter the python commands that are in run_cross_analysis.sh . Below are the steps and explanations:
## Create a table for the current launch using create_cross_analysis_table.sh. The table will be created from the file template_table_schema.sql and contain columns id, run, file, timeChange, cpu_sec, wall_sec, mem_kb, vmem_kb, nevents, input_copy_sec, plugin_sec, final_state, problems
## Run <pre>python fill_cross_analysis_entries.py [RUNPERIOD] [VERSION]</pre> The script will gather all of the necessary information either from SWIF output or the stdout files for the jobs
## Run <pre>python create_stats_table_row.py</pre> This will loop over the jobs in the current launch and create a row in an HTML table that summarizes the statistics for the final state and problems for the jobs. This table row is then inserted into the main HTML webpage for the run period.
## Run <pre>python create_stats_for_each_file.py [RUNPERIOD] [MINVERSION] [VERSION]</pre> This creates a comparison table of the final state and problems for each file between launches [MINVERSION] and [VERSION]. In the run_cross_analysis.sh script, the default is to set MINVERSION to be 15 for run period 2015_03, but as long as SWIF was used for all previous launches, any number will work.
## Run <pre>python create_resource_correlation_plots.py [RUNVERSION] [CMPMINVERSION] [VERSION]</pre> This creates correlation plots of resource use between launches between CMPMINVERSION and VERSION. By default CMPMINVERSION is 5 launches earlier.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Called by the "create" option of hdswif.py
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
| read_config.py ||
* Called by the "add" option of hdswif.py
* Takes in config file name, optionally set verbose
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
|-
| output_job_details.py ||
* Called by the "details" option of hdswif.py
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
* Finds ids for jobs specified by the run and file number and returns info on each one
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
|-
| results_by_resources.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Creates html table showing results of jobs by resources requested for that job
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
|-
| create_ordered_hists.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots are dependency time and pending time of each job, ordered by submission.
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
|-
| create_stacked_times.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots show total job time divided into colors for different stages
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
* The jobs show at a glance which stage contributes how much of the job's time
|}

=== Utility scripts ===
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| stderr_by_size.py ||
* Independent of all other scripts in hdswif directory
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, in separate directories given by stderr file size.
|}

=== cross_analysis scripts ===
Summary: The script run_cross_analysis.sh is the main script. In principle, running this with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre> should work, but it is recommended that the commands within the script
be run by hand to catch any errors.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| run_cross_analysis.sh ||
* Main script to call other Python scripts.
* Takes in run period and version as arguments and runs cross analysis
|-
| create_cross_analysis_table.sh ||
* Creates MySQL table for current launch that is used by other scripts
* Takes run period and version as input, and creates table cross_analysis_table_[RUNPERIOD]_ver[VERSION]
|-
| create_stats_table_row.py ||
* Create a row in the html table showing the overall statistics of final states and problems for a launch
* The final states are "Success", and "Segfault". "Success" includes all jobs that had problems but still finished with Success.
* The problems are "Over Limit", "Timeout", and "System". If any of these occurred for any attempt of the job they are counted.
* The script creates a new html table row for the current launch. This html snippet is inserted into the web-accessible file showing the results
|-
| create_stats_for_each_file.py ||
* Create html tables showing the status of each file against different launch versions.
* Takes in run period, min version, max version and shows final result and problems for all versions in between
* Different final results and problems are shown by combinations of ext content, olor coding of text, background color
|-
|-
| create_resource_correlation_plots.py ||
* Create plots showing correlation of resources between different launches for each file.
* Takes in run period, min version, version of interest. Points are shown only for files included in version of interest
* Creates plots for CPU time, Wall time, memory, virtual memory, #events, difference in #events, time to copy input evio file, time to run plugin
|}

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-12-11T15:37:34Z

Kmoriya: /* General Information on Procedures */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing.
For offline monitoring, the hdswif system that Kei developed is used for launching the jobs, and a new cross analysis
system based on MySQL and Python is maintained. The jproj system is now deprecated for offline monitoring.

The scripts for hdswif, cross analysis and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* cross analysis: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

1. Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif.
<pre>
cd ~/halld
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the jproj scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis. For the gxprojN accounts this directory should exist as ~/halld/monitoring/cross_analysis. To publish the results online do for example <pre>python ~/halld/monitoring/cross_analysis/publish_offmon_results.py 2015_03 18</pre> The script copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/ and also creates a link to it in the html page.
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/monitoring/cross_analysis
# The main script is run_cross_analysis.sh, which can be run with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre>. However, it is strongly recommended that the commands in this script be run by hand to catch any errors.
# Enter the python commands that are in run_cross_analysis.sh . Below are the steps and explanations:
## Create a table for the current launch using create_cross_analysis_table.sh. The table will be created from the file template_table_schema.sql and contain columns id, run, file, timeChange, cpu_sec, wall_sec, mem_kb, vmem_kb, nevents, input_copy_sec, plugin_sec, final_state, problems
## Run <pre>python fill_cross_analysis_entries.py [RUNPERIOD] [VERSION]</pre> The script will gather all of the necessary information either from SWIF output or the stdout files for the jobs
## Run <pre>python create_stats_table_row.py</pre> This will loop over the jobs in the current launch and create a row in an HTML table that summarizes the statistics for the final state and problems for the jobs. This table row is then inserted into the main HTML webpage for the run period.
## Run <pre>python create_stats_for_each_file.py [RUNPERIOD] [MINVERSION] [VERSION]</pre> This creates a comparison table of the final state and problems for each file between launches [MINVERSION] and [VERSION]. In the run_cross_analysis.sh script, the default is to set MINVERSION to be 15 for run period 2015_03, but as long as SWIF was used for all previous launches, any number will work.
## Run <pre>python create_resource_correlation_plots.py [RUNVERSION] [CMPMINVERSION] [VERSION]</pre> This creates correlation plots of resource use between launches between CMPMINVERSION and VERSION. By default CMPMINVERSION is 5 launches earlier.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Called by the "create" option of hdswif.py
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
| read_config.py ||
* Called by the "add" option of hdswif.py
* Takes in config file name, optionally set verbose
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
|-
| output_job_details.py ||
* Called by the "details" option of hdswif.py
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
* Finds ids for jobs specified by the run and file number and returns info on each one
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
|-
| results_by_resources.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Creates html table showing results of jobs by resources requested for that job
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
|-
| create_ordered_hists.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots are dependency time and pending time of each job, ordered by submission.
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
|-
| create_stacked_times.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots show total job time divided into colors for different stages
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
* The jobs show at a glance which stage contributes how much of the job's time
|}

=== Utility scripts ===
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| stderr_by_size.py ||
* Independent of all other scripts in hdswif directory
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, in separate directories given by stderr file size.
|}

=== cross_analysis scripts ===
Summary: The script run_cross_analysis.sh is the main script. In principle, running this with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre> should work, but it is recommended that the commands within the script
be run by hand to catch any errors.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| run_cross_analysis.sh ||
* Main script to call other Python scripts.
* Takes in run period and version as arguments and runs cross analysis
|-
| create_cross_analysis_table.sh ||
* Creates MySQL table for current launch that is used by other scripts
* Takes run period and version as input, and creates table cross_analysis_table_[RUNPERIOD]_ver[VERSION]
|-
| create_stats_table_row.py ||
* Create a row in the html table showing the overall statistics of final states and problems for a launch
* The final states are "Success", and "Segfault". "Success" includes all jobs that had problems but still finished with Success.
* The problems are "Over Limit", "Timeout", and "System". If any of these occurred for any attempt of the job they are counted.
* The script creates a new html table row for the current launch. This html snippet is inserted into the web-accessible file showing the results
|-
| create_stats_for_each_file.py ||
* Create html tables showing the status of each file against different launch versions.
* Takes in run period, min version, max version and shows final result and problems for all versions in between
* Different final results and problems are shown by combinations of ext content, olor coding of text, background color
|-
|-
| create_resource_correlation_plots.py ||
* Create plots showing correlation of resources between different launches for each file.
* Takes in run period, min version, version of interest. Points are shown only for files included in version of interest
* Creates plots for CPU time, Wall time, memory, virtual memory, #events, difference in #events, time to copy input evio file, time to run plugin
|}

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-12-11T04:36:53Z

Kmoriya: /* Cross Analysis of Launches */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

1. Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif.
<pre>
cd ~/halld
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the jproj scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis. For the gxprojN accounts this directory should exist as ~/halld/monitoring/cross_analysis. To publish the results online do for example <pre>python ~/halld/monitoring/cross_analysis/publish_offmon_results.py 2015_03 18</pre> The script copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/ and also creates a link to it in the html page.
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/monitoring/cross_analysis
# The main script is run_cross_analysis.sh, which can be run with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre>. However, it is strongly recommended that the commands in this script be run by hand to catch any errors.
# Enter the python commands that are in run_cross_analysis.sh . Below are the steps and explanations:
## Create a table for the current launch using create_cross_analysis_table.sh. The table will be created from the file template_table_schema.sql and contain columns id, run, file, timeChange, cpu_sec, wall_sec, mem_kb, vmem_kb, nevents, input_copy_sec, plugin_sec, final_state, problems
## Run <pre>python fill_cross_analysis_entries.py [RUNPERIOD] [VERSION]</pre> The script will gather all of the necessary information either from SWIF output or the stdout files for the jobs
## Run <pre>python create_stats_table_row.py</pre> This will loop over the jobs in the current launch and create a row in an HTML table that summarizes the statistics for the final state and problems for the jobs. This table row is then inserted into the main HTML webpage for the run period.
## Run <pre>python create_stats_for_each_file.py [RUNPERIOD] [MINVERSION] [VERSION]</pre> This creates a comparison table of the final state and problems for each file between launches [MINVERSION] and [VERSION]. In the run_cross_analysis.sh script, the default is to set MINVERSION to be 15 for run period 2015_03, but as long as SWIF was used for all previous launches, any number will work.
## Run <pre>python create_resource_correlation_plots.py [RUNVERSION] [CMPMINVERSION] [VERSION]</pre> This creates correlation plots of resource use between launches between CMPMINVERSION and VERSION. By default CMPMINVERSION is 5 launches earlier.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Called by the "create" option of hdswif.py
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
| read_config.py ||
* Called by the "add" option of hdswif.py
* Takes in config file name, optionally set verbose
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
|-
| output_job_details.py ||
* Called by the "details" option of hdswif.py
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
* Finds ids for jobs specified by the run and file number and returns info on each one
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
|-
| results_by_resources.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Creates html table showing results of jobs by resources requested for that job
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
|-
| create_ordered_hists.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots are dependency time and pending time of each job, ordered by submission.
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
|-
| create_stacked_times.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots show total job time divided into colors for different stages
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
* The jobs show at a glance which stage contributes how much of the job's time
|}

=== Utility scripts ===
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| stderr_by_size.py ||
* Independent of all other scripts in hdswif directory
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, in separate directories given by stderr file size.
|}

=== cross_analysis scripts ===
Summary: The script run_cross_analysis.sh is the main script. In principle, running this with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre> should work, but it is recommended that the commands within the script
be run by hand to catch any errors.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| run_cross_analysis.sh ||
* Main script to call other Python scripts.
* Takes in run period and version as arguments and runs cross analysis
|-
| create_cross_analysis_table.sh ||
* Creates MySQL table for current launch that is used by other scripts
* Takes run period and version as input, and creates table cross_analysis_table_[RUNPERIOD]_ver[VERSION]
|-
| create_stats_table_row.py ||
* Create a row in the html table showing the overall statistics of final states and problems for a launch
* The final states are "Success", and "Segfault". "Success" includes all jobs that had problems but still finished with Success.
* The problems are "Over Limit", "Timeout", and "System". If any of these occurred for any attempt of the job they are counted.
* The script creates a new html table row for the current launch. This html snippet is inserted into the web-accessible file showing the results
|-
| create_stats_for_each_file.py ||
* Create html tables showing the status of each file against different launch versions.
* Takes in run period, min version, max version and shows final result and problems for all versions in between
* Different final results and problems are shown by combinations of ext content, olor coding of text, background color
|-
|-
| create_resource_correlation_plots.py ||
* Create plots showing correlation of resources between different launches for each file.
* Takes in run period, min version, version of interest. Points are shown only for files included in version of interest
* Creates plots for CPU time, Wall time, memory, virtual memory, #events, difference in #events, time to copy input evio file, time to run plugin
|}

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-12-11T04:19:45Z

Kmoriya: /* Post-analysis of statistics of the launch */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

1. Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif.
<pre>
cd ~/halld
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the jproj scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis. For the gxprojN accounts this directory should exist as ~/halld/monitoring/cross_analysis. To publish the results online do for example <pre>python ~/halld/monitoring/cross_analysis/publish_offmon_results.py 2015_03 18</pre> The script copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/ and also creates a link to it in the html page.
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>
# Backing up tables Tables created in MySQL should be backed up.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Called by the "create" option of hdswif.py
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
| read_config.py ||
* Called by the "add" option of hdswif.py
* Takes in config file name, optionally set verbose
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
|-
| output_job_details.py ||
* Called by the "details" option of hdswif.py
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
* Finds ids for jobs specified by the run and file number and returns info on each one
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
|-
| results_by_resources.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Creates html table showing results of jobs by resources requested for that job
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
|-
| create_ordered_hists.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots are dependency time and pending time of each job, ordered by submission.
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
|-
| create_stacked_times.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots show total job time divided into colors for different stages
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
* The jobs show at a glance which stage contributes how much of the job's time
|}

=== Utility scripts ===
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| stderr_by_size.py ||
* Independent of all other scripts in hdswif directory
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, in separate directories given by stderr file size.
|}

=== cross_analysis scripts ===
Summary: The script run_cross_analysis.sh is the main script. In principle, running this with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre> should work, but it is recommended that the commands within the script
be run by hand to catch any errors.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| run_cross_analysis.sh ||
* Main script to call other Python scripts.
* Takes in run period and version as arguments and runs cross analysis
|-
| create_cross_analysis_table.sh ||
* Creates MySQL table for current launch that is used by other scripts
* Takes run period and version as input, and creates table cross_analysis_table_[RUNPERIOD]_ver[VERSION]
|-
| create_stats_table_row.py ||
* Create a row in the html table showing the overall statistics of final states and problems for a launch
* The final states are "Success", and "Segfault". "Success" includes all jobs that had problems but still finished with Success.
* The problems are "Over Limit", "Timeout", and "System". If any of these occurred for any attempt of the job they are counted.
* The script creates a new html table row for the current launch. This html snippet is inserted into the web-accessible file showing the results
|-
| create_stats_for_each_file.py ||
* Create html tables showing the status of each file against different launch versions.
* Takes in run period, min version, max version and shows final result and problems for all versions in between
* Different final results and problems are shown by combinations of ext content, olor coding of text, background color
|-
|-
| create_resource_correlation_plots.py ||
* Create plots showing correlation of resources between different launches for each file.
* Takes in run period, min version, version of interest. Points are shown only for files included in version of interest
* Creates plots for CPU time, Wall time, memory, virtual memory, #events, difference in #events, time to copy input evio file, time to run plugin
|}

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

GlueX Offline Meeting, December 9, 2015

2015-12-09T18:27:06Z

Kmoriya: /* Agenda */

GlueX Offline Software Meeting 
Wednesday, December 9, 2015 
1:30 pm EST 
JLab: CEBAF Center F326/327

==Agenda==

# Announcements
## [https://mailman.jlab.org/pipermail/halld-offline/2015-November/002175.html New sim-recon release: 1.7.0]
## [https://mailman.jlab.org/pipermail/halld-offline/2015-December/002177.html CCDB release v1.06.00] (Dmitry)
## [https://mailman.jlab.org/pipermail/halld-online/2015-December/000692.html JANA 0.7.4] (David)
## We should expect a Software Review in Summer 2016.
# Review of [[GlueX Offline Meeting, November 11, 2015#Minutes|minutes from November 11]] (all)
# [https://halldweb1.jlab.org/wiki/images/0/0c/2015-12-09-offline_monitoring.pdf Offline Monitoring] (Kei)
# Geant4 Update (Richard, David)
# [https://mailman.jlab.org/pipermail/halld-tagger/2015-November/000565.html Update to CCDB for microscope energy] (Richard, Alex B.)
# BCAL timing calibration and pion ID in simulation (Paul M., Sean, Tegan)
# [https://mailman.jlab.org/pipermail/halld-offline/2015-December/002179.html C++ code analyzer + clang 3.7.0] (David)
# [https://mailman.jlab.org/pipermail/halld-offline/2015-November/002172.html JANA status bits] (David)
# Run number assignments for simulation for future run periods (Justin)
# Review of [https://github.com/JeffersonLab/sim-recon/pulls?q=is%3Aopen+is%3Apr recent pull requests]
# Deleting old pull-request builds (Sean)
# Running tests on pull-request builds (Nathan)
# [[Data Challenge 3]] update (Mark)
# [[Sim1 Conditions|Future Commissioning Simulations]] (all)
# Action Item Review

==Communication Information==

===Remote Connection===

* The BlueJeans meeting number is 968 592 007 .
* [http://bluejeans.com/968592007 Join the Meeting] via BlueJeans

===Slides===

Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2015</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2015/ .

File:2015-12-09-offline monitoring.pdf

2015-12-09T18:26:43Z

Kmoriya: Talk by Kei on offline monitoring for offline meeting 2015/12/09

Talk by Kei on offline monitoring for offline meeting 2015/12/09

Data Monitoring Procedures

2015-12-05T03:04:12Z

Kmoriya: /* Explanation of Scripts */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

1. Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif.
<pre>
cd ~/halld
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the jproj scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj. For the gxprojN accounts this directory should exist as ~/halld/jproj. To publish the results online do for example <pre>python ~/halld/jproj/projects/templates/publish_offmon_results.py 2015_03 18</pre> The script simply copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>
# Backing up tables Tables created in MySQL should be backed up.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Called by the "create" option of hdswif.py
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
| read_config.py ||
* Called by the "add" option of hdswif.py
* Takes in config file name, optionally set verbose
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
|-
| output_job_details.py ||
* Called by the "details" option of hdswif.py
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
* Finds ids for jobs specified by the run and file number and returns info on each one
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
|-
| results_by_resources.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Creates html table showing results of jobs by resources requested for that job
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
|-
| create_ordered_hists.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots are dependency time and pending time of each job, ordered by submission.
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
|-
| create_stacked_times.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots show total job time divided into colors for different stages
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
* The jobs show at a glance which stage contributes how much of the job's time
|}

=== Utility scripts ===
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| stderr_by_size.py ||
* Independent of all other scripts in hdswif directory
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, in separate directories given by stderr file size.
|}

=== cross_analysis scripts ===
Summary: The script run_cross_analysis.sh is the main script. In principle, running this with <pre>./run_cross_analysis.sh [RUNPERIOD] [VERSION]</pre> should work, but it is recommended that the commands within the script
be run by hand to catch any errors.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| run_cross_analysis.sh ||
* Main script to call other Python scripts.
* Takes in run period and version as arguments and runs cross analysis
|-
| create_cross_analysis_table.sh ||
* Creates MySQL table for current launch that is used by other scripts
* Takes run period and version as input, and creates table cross_analysis_table_[RUNPERIOD]_ver[VERSION]
|-
| create_stats_table_row.py ||
* Create a row in the html table showing the overall statistics of final states and problems for a launch
* The final states are "Success", and "Segfault". "Success" includes all jobs that had problems but still finished with Success.
* The problems are "Over Limit", "Timeout", and "System". If any of these occurred for any attempt of the job they are counted.
* The script creates a new html table row for the current launch. This html snippet is inserted into the web-accessible file showing the results
|-
| create_stats_for_each_file.py ||
* Create html tables showing the status of each file against different launch versions.
* Takes in run period, min version, max version and shows final result and problems for all versions in between
* Different final results and problems are shown by combinations of ext content, olor coding of text, background color
|-
|-
| create_resource_correlation_plots.py ||
* Create plots showing correlation of resources between different launches for each file.
* Takes in run period, min version, version of interest. Points are shown only for files included in version of interest
* Creates plots for CPU time, Wall time, memory, virtual memory, #events, difference in #events, time to copy input evio file, time to run plugin
|}

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-12-05T02:42:21Z

Kmoriya: /* Explanation of Scripts */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

1. Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif.
<pre>
cd ~/halld
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the jproj scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj. For the gxprojN accounts this directory should exist as ~/halld/jproj. To publish the results online do for example <pre>python ~/halld/jproj/projects/templates/publish_offmon_results.py 2015_03 18</pre> The script simply copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>
# Backing up tables Tables created in MySQL should be backed up.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Called by the "create" option of hdswif.py
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
| read_config.py ||
* Called by the "add" option of hdswif.py
* Takes in config file name, optionally set verbose
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
|-
| output_job_details.py ||
* Called by the "details" option of hdswif.py
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
* Finds ids for jobs specified by the run and file number and returns info on each one
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
|-
| results_by_resources.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Creates html table showing results of jobs by resources requested for that job
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
|-
| create_ordered_hists.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots are dependency time and pending time of each job, ordered by submission.
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
|-
| create_stacked_times.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots show total job time divided into colors for different stages
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
* The jobs show at a glance which stage contributes how much of the job's time
|}

=== Utility scripts ===
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| stderr_by_size.py ||
* Independent of all other scripts in hdswif directory
* When diagnosing problems it is useful to check the stderr/stdout files. Frequently, different problems are easier to find based on the size of stderr files
* Takes in run period and version as arguments, creates a directory /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VER]/log/bysize that contains soft links to all stdout and stderr files from the specified launch, in separate directories given by stderr file size.
|}

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-12-05T02:36:56Z

Kmoriya: /* hdswif scripts */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

1. Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif.
<pre>
cd ~/halld
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the jproj scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj. For the gxprojN accounts this directory should exist as ~/halld/jproj. To publish the results online do for example <pre>python ~/halld/jproj/projects/templates/publish_offmon_results.py 2015_03 18</pre> The script simply copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>
# Backing up tables Tables created in MySQL should be backed up.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Called by the "create" option of hdswif.py
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
| read_config.py ||
* Called by the "add" option of hdswif.py
* Takes in config file name, optionally set verbose
* Return a dictionary between config parameter names and values (e.g., 'PROJECT' : 'gluex', 'NCORES' : 6, ...)
* Prints the config parameters at the end. For parameters changed from default, a '*' will be printed
|-
| output_job_details.py ||
* Called by the "details" option of hdswif.py
* Takes in workflow name, run and file. Run and file must be numbers, no wildcards
* Finds ids for jobs specified by the run and file number and returns info on each one
* Job info is retrieved from pbs farm system and shows configuration parameters for that job
|-
| results_by_resources.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Creates html table showing results of jobs by resources requested for that job
* This table is shown under "Status by Resources" in the output html file of hdswif.py summary [workflow]
|-
| create_ordered_hists.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots are dependency time and pending time of each job, ordered by submission.
* Different colors represent jobs submitted at different times. For jobs submitted at the same time, jobs are ordered in increasing time.
|-
| create_stacked_times.py ||
* Called within parse_swif.py
* Takes in XML output from SWIF and creates 2 plots
* Plots show total job time divided into colors for different stages
* One shows all jobs in order of Auger ID (roughly submission order), the other one shows in order of total job time
* The jobs show at a glance which stage contributes how much of the job's time
|}

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-12-05T00:16:13Z

Kmoriya:

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

1. Setup the environment: <pre>source ~/env_monitoring_launch</pre>

2. Updating & building hdds:
<pre>
cd $HDDS_HOME
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

3. Updating & building sim-recon:
<pre>
cd $HALLD_HOME/src
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>

4. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here].
<pre>
cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file
</pre>

5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

1. Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif.
<pre>
cd ~/halld
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif
</pre>

2. Edit the job config file, which is used to register jobs in hdswif. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.
swif list

For creation of workflows for offline monitoring the command:
hdswif.py create [workflow] -c input.config
should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:
/group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml

The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:
hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the jproj scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj. For the gxprojN accounts this directory should exist as ~/halld/jproj. To publish the results online do for example <pre>python ~/halld/jproj/projects/templates/publish_offmon_results.py 2015_03 18</pre> The script simply copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>
# Backing up tables Tables created in MySQL should be backed up.

== Explanation of Scripts ==
Below are explanations of each script used in the offline monitoring system and a breif explanation of how they work.

=== hdswif scripts ===
Summary: hdswif.py is the main script that calls the other utility scripts. For the utility scripts, they can be run standalone by giving the appropriate parameters. Visual graphics are made using the PyROOT extension of ROOT. To use this, the environment variable PYTHONPATH must include ROOTSYS. If ROOTSYS has been set, adding it to PYTHONPATH will be done by the script, but if ROOTSYS is not set, then the scripts will abort.

{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| file name
!width="800"| Description
|-
| hdswif.py ||
Main script to control the behavior of SWIF. Most commands follow the form hdswif.py [command] [workflow] (options)
|-
| parse_swif.py ||
Called within hdswif.py to create html output from SWIF results.
|-
| createXMLfiles.py ||
* Creates XML files for logging information about launch. Must specify config file with option -c.
* Also adds tags to git repositories of sim-recon and hdds.
* For XML file creation, the file will be written out to /group/halld/data_monitoring/run_conditions/ if the user is gxprojN (used for offline monitoring). If not, to avoid general users adding things to the above directory, the output files will be the current directory.
* To write out the versions of each package, environment variables such as HDDS_HOME will need to be set.
* Versions for each software package are extracted using the directory structure, so if these are changed the scripts must change accordingly.
* For each launch, the output soft_comm_[RUNPERIOD]_ver[VER].xml file should be checked that all version numbers have been extracted.
|-
|}

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-11-19T12:47:56Z

Kmoriya: /* Post-analysis of statistics of the launch */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

1. Setup the environment: <pre>source ~/setup_jlab-2015-03.csh</pre>
2. Building hdds:
<pre>
cd ~/builds/hdds/hdds
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
3. Building sim-recon:
<pre>
cd ~/builds/sim-recon/sim-recon/
git pull
cd src
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
4. Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015 
<pre>
cd ~/tmp
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ../
</pre>
5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. As an example config file, see the input.config file in the folder (and update it). When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run -workflow [workflow] -errorlimit none</pre> MAKE SURE THE ERRORLIMIT IS SET TO NONE OR THE WORKFLOW WILL BE STOPPED AFTER ANY JOB FAILS Or equivalently, using the hdswif wrapper (which has the errorlimit set by default), <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the jproj scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj. For the gxprojN accounts this directory should exist as ~/halld/jproj. To publish the results online do for example <pre>python ~/halld/jproj/projects/templates/publish_offmon_results.py 2015_03 18</pre> The script simply copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
# Backing up SWIF output With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/halld/hdswif/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>

=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>
# Backing up tables Tables created in MySQL should be backed up.

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

GlueX Offline Meeting, November 11, 2015

2015-11-11T18:27:44Z

Kmoriya: /* Agenda */

GlueX Offline Software Meeting 
Wednesday, November 11, 2015 
1:30 pm EST 
JLab: CEBAF Center F326/327

==Agenda==

# Announcements
## New work disk: [https://mailman.jlab.org/pipermail/halld-offline/2015-November/002162.html /work/halld2]
## [https://mailman.jlab.org/mailman/private/gluex-collaboration/2015-November/004139.html Private Wiki] open for business
# Review of [[GlueX Offline Meeting, October 28, 2015#Minutes|minutes from October 28]] (all)
# [https://halldweb.jlab.org/wiki/images/a/ab/2015-11-11-offline_monitoring.pdf Offline Monitoring] (Kei)
# Geant4 Update (Richard, David)
# [[Data Challenge 3]] update (Mark)
# [[Sim1 Conditions|Future Commissioning Simulations]] (all)
# [https://halldweb.jlab.org/talks/2015/fetch-dist_111115.pdf Binary Distributions of GlueX Software] (Nathan)
# [[Automatic Tests of GlueX Software|b1pi results review]]
# Review of [https://github.com/JeffersonLab/sim-recon/pulls?q=is%3Aopen+is%3Apr recent pull requests]
# Action Item Review

==Communication Information==

===Remote Connection===

* The BlueJeans meeting number is 968 592 007 .
* [http://bluejeans.com/968592007 Join the Meeting] via BlueJeans

===Slides===

Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2015</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2015/ .

File:2015-11-11-offline monitoring.pdf

2015-11-11T18:27:25Z

Kmoriya: Talk by Kei on offline monitoring for offline meeting 2015/11/11

Talk by Kei on offline monitoring for offline meeting 2015/11/11

Data Monitoring Procedures

2015-11-09T15:30:39Z

Kmoriya: /* Offline Monitoring: Running Over Archived Data */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

1. Setup the environment: <pre>source ~/setup_jlab-2015-03.csh</pre>
2. Building hdds:
<pre>
cd ~/builds/hdds/hdds
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
3. Building sim-recon:
<pre>
cd ~/builds/sim-recon/sim-recon/
git pull
cd src
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
4. Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015 
<pre>
cd ~/tmp
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ../
</pre>
5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. As an example config file, see the input.config file in the folder (and update it). When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run -workflow [workflow] -errorlimit none</pre> MAKE SURE THE ERRORLIMIT IS SET TO NONE OR THE WORKFLOW WILL BE STOPPED AFTER ANY JOB FAILS Or equivalently, using the hdswif wrapper (which has the errorlimit set by default), <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
The next step is to check the resource usage for the current launch and publish the results online.

# Create summary XML, HTML files The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
# Publish output files online At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the jproj scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj. For the gxprojN accounts this directory should exist as ~/halld/jproj. To publish the results online do for example <pre>python ~/halld/jproj/projects/templates/publish_offmon_results.py 2015_03 18</pre> The script simply copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/
# Editing the summary HTML page The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period].html Edit the file to
## Add a new line to the first table which contains the version number, date, and comments for the current launch
## Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
# Freezing SWIF tables Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do <pre>swif freeze [workflow]</pre>
=== Cross Analysis of Launches ===

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

#The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>
# Backing up tables Tables created in MySQL should be backed up.

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-11-08T03:48:47Z

Kmoriya: /* Starting the Launch and Submitting Jobs */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

1. Setup the environment: <pre>source ~/setup_jlab-2015-03.csh</pre>
2. Building hdds:
<pre>
cd ~/builds/hdds/hdds
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
3. Building sim-recon:
<pre>
cd ~/builds/sim-recon/sim-recon/
git pull
cd src
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
4. Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015 
<pre>
cd ~/tmp
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ../
</pre>
5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. As an example config file, see the input.config file in the folder (and update it). When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run -workflow [workflow] -errorlimit none</pre> MAKE SURE THE ERRORLIMIT IS SET TO NONE OR THE WORKFLOW WILL BE STOPPED AFTER ANY JOB FAILS Or equivalently, using the hdswif wrapper (which has the errorlimit set by default), <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:
* Check stderr files. Are they small (<kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?

 For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> on Auger and for SWIF with <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-11-08T02:40:28Z

Kmoriya: /* Starting the Launch and Submitting Jobs */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

1. Setup the environment: <pre>source ~/setup_jlab-2015-03.csh</pre>
2. Building hdds:
<pre>
cd ~/builds/hdds/hdds
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
3. Building sim-recon:
<pre>
cd ~/builds/sim-recon/sim-recon/
git pull
cd src
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
4. Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015 
<pre>
cd ~/tmp
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ../
</pre>
5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. As an example config file, see the input.config file in the folder (and update it). When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run -workflow [workflow] -errorlimit none</pre> MAKE SURE THE ERRORLIMIT IS SET TO NONE OR THE WORKFLOW WILL BE STOPPED AFTER ANY JOB FAILS or equivalently, using the hdswif wrapper (which has the errorlimit set by default), <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted.

Checklist to make sure jobs are running correctly:
* Check stderr files. Are they very large (>kB)?
* Check stdout files. Are they very large (>MB)?
* Check output ROOT files. Are they larger than several MB?
* Check output REST files. Are they larger than several tens of MB?
To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> on Auger and for SWIF with <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-11-08T02:14:47Z

Kmoriya: /* Checking the Status and Resubmitting */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

1. Setup the environment: <pre>source ~/setup_jlab-2015-03.csh</pre>
2. Building hdds:
<pre>
cd ~/builds/hdds/hdds
git pull # Get latest software
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
3. Building sim-recon:
<pre>
cd ~/builds/sim-recon/sim-recon/
git pull
cd src
scons -c install # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4 # Rebuild and re-install with 4 threads
</pre>
4. Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015 
<pre>
cd ~/tmp
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ../
</pre>
5. Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. As an example config file, see the input.config file in the folder (and update it). When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run -workflow [workflow] -errorlimit none</pre> MAKE SURE THE ERRORLIMIT IS SET TO NONE OR THE WORKFLOW WILL BE STOPPED AFTER ANY JOB FAILS or equivalently, using the hdswif wrapper (which has the errorlimit set by default), <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> on Auger and for SWIF with <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].

2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs -ram add 2gb -problems AUGER-OVER_RLIMIT</pre>
This only re-stages the jobs, be sure to resubmit them with:
<pre>swif run -workflow [workflow] -errorlimit none</pre>

hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.
{| border="1" cellpadding="0" valign="left" style="text-align: left;"
!width="150"| ERROR NAME
!width="400"| Description
!width="400"| Resolution
!width="400"| hdswif command
|-
| AUGER-SUBMIT
||
SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)
||
If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-FAILED
||
Auger reports the job FAILED with no specific details.
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-OUTPUT-FAIL
||
Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.
||
Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-INPUT-FAIL
||
Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)
||
Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|-
| AUGER-TIMEOUT
||
Job timed out.
||
If more time is needed for job add more resources.
Default is to add 2 hrs of processing time. Also check whether code is hanging.
||
hdswif.py resubmit [workflow] TIMEOUT 
Default is to add 2 hours. Optionally specify number of hours at end.
|-
| AUGER-OVER_RLIMIT
||
Not enough resources, RAM or disk space.
||
Add more resources for job.
||
hdswif.py resubmit [workflow] RLIMIT 
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.
|-
| SWIF-MISSING-OUTPUT
||
Output file specified by user was not found.
||
Check if output file exists at end of job.
||

|-
| SWIF-USER-NON-ZERO
||
User script exited with non-zero status code.
||
Your script exited with non-zero status. Check the code you are running.
||

|-
| SWIF-SYSTEM-ERROR
||
Job failed owing to a problem with swif (e.g. network connection timeout)
||
Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.
||
hdswif.py resubmit [workflow] SYSTEM
|}
 

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

GlueX Talks

2015-11-02T20:45:13Z

Kmoriya: /* Talks in 2014 */

== Conference and Workshop Talks==

=== Talks in 2015 ===
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2872 A first look at reconstructed data from the GlueX detector] DNP15, October 28-31, 2015, Santa Fe, NW, presented by Simon Taylor.
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2687 Analysis Plans in GlueX] Athos/PWA Meeting, April 13-17 2015, GWU, presented by Curtis A. Meyer.
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2745 Light-quark meson spectroscopy with the GlueX experiment] CIPANP May 2015, presented by Mark M. Dalton.
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2795 Electromagnetic Production of Strangeness at Jefferson Lab] [http://lambda.phys.tohoku.ac.jp/hyp2015/ HYP2015], September 2015, presented by Kei Moriya
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2799 First GlueX Results] Hadron 2015, September 2015, presented by Curtis A. Meyer.

=== Talks in 2014 ===
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2506 Physics in Hall D] Jefferson Lab 2014 Users Group Meeting, June 2-4, 2014, presented by Curtis A. Meyer.
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2534 The GlueX Experiment at Jefferson Lab], HaPhy-CLAS Workshop on Hadron Productions, August 2014, Kei Moriya
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2532 The GlueX experiment and the search for exotic mesons], PANIC 2014, August 25-29, Will Levine
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2541 Electromagnetic Strangeness Production at Jefferson Lab Energies,] XI Conference on Quark Confinement at the Hadron Spectrum, September 8-12, 2014, presented by R. A. Schumacher.
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2603 Eta Decay Programm at GlueX], MesonNet Meeting, INFN Frascati, September 29, Alexander Somov
* [[DNP-2014 abstract submitted by Richard Jones|Abstract submitted to Mini-symposium on Hybrid Mesons and Molecules]], DNP-2014, Oct. 7-12, 2014, by Richard Jones
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2689 GlueX Analysis] presented at the Future Directions in Partial Wave Analysis Workshop at Jefferson Lab, November 2014.

=== Talks in 2013 ===
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2375 Hybrid Meson Lectures] presented by Curtis Meyer in December 2013.
* Carnegie Mellon University undergraduate colloquium, November 2013. [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2373 Big Science] a talk on how large science projects are conceived, funded and built following the 12-GeV upgrade as an example (Curtis A. Meyer).
* [https://www.jlab.org/conferences/dnp2013/dnp-13.html APS-DNP 2013] October 23-26, 2013, Newport News, VA
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2355 Characteristics of Silicon Photomultipliers (SiPM) for GlueX], Yi Qiang
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2546 Reconstruction of showers in the GlueX barrel calorimeter], Will Levine
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2674 JLab Hall-D Photon Beamline], Alexander Somov
* [http://www.int.washington.edu/NNPSS/2013/HOME.html Nuclear Science Summer School] July 15-26, 2013, Stony Brook University campus. [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2278 Colloquium] presented by Curtis A. Meyer.
* [http://www.inpc2013.it INPC2013] 25th International Nuclear Physics Conference June 2-7, 2013 Florence, Italy
* [https://hep.ustc.edu.cn/indico/conferenceOtherViews.py?view=standard&confId=1 The Fifth Workshop on Hadron Physics in China and Opportunities in US], July 2-6, Huangshan, China
** [http://argus.phys.uregina.ca/gluex/DocDB/0022/002282/001/20130703_Physics_HallD.pdf Physics Program at Jefferson Lab Hall-D], Yi Qiang
* [http://www.aps.org/meetings/april/index.cfm APS] American Physical Society Meeting, April Mtg., Apr. 13-16, 2013, Denver, CO
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2283 Characteristics of S12045(X) photon sensors for GlueX], Elton Smith
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2284 The Barrel Calorimeter for the GlueX Experiment at Jefferson Lab], Zisis Papandreou (presented by Elton Smith)
* [https://sites.google.com/site/ghpworkshop/ GHP Workshop] Meeting of the APS Topical Group on Hadron Physics (GHP), Apr. 10-12, 2013, Denver, CO
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2189 Exploration of Exotic Mesons with GlueX], Elton Smith

=== Talks in 2012 ===

* PANDA XLIII Collaboration Meeting, Dec 10-14, 2012 GSI, Germany
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2122 GlueX: Photoproduction of Hybrid Mesons], Elton Smith
* Spectroscopy at 12 GeV Workshop, Jefferson Lab
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2192 Current PWA Projects in Hall D], Matt Shepherd
*[http://www.aps.org/units/dnp/meetings/meeting.cfm?name=DNP12/ DNP 2012] 2012 Fall Meeting of the APS Division of Nuclear Physics, October 24 - Ocober 27, 2012 Newport Beach, California
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2110 GlueX: Thin Diamond Radiators for the GlueX Experiment], Brendan Pratt
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2111 Collimation and Tagging Instrumentation for the GlueX Photon Beamline], R.T. Jones
* PANDA XLII Collaboration Meeting, June 26-29, 2012 Evanston, IL
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2193 GlueX: A search for light quark exotic mesons at Jefferson Lab], Matt Shepherd
* [http://www.ge.infn.it/~athos12/ATHOS/Welcome.html ATHOS 2012] International Workshop on partial wave analysis, June 20-23, 2012.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2029 Talk] presented by Curtis Meyer.
* [http://www.cap.ca/en/congress/2012 CAP Congress 2012], University of Calgary (Calgary, Alberta), June 11-15, 2012
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2026 The GlueX Large Area Silicon Photomultipliers], Z. Papandreou
*[http://www.jlab.org/conferences/ugm/program.htm Jefferson Lab User's Meeting], June 4-6, 2012.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2025 Talk] presented by Curtis Meyer.
*[http://meson.if.uj.edu.pl/ MESON 2012] 12th International Workshop on Meson Production, Properties and Interaction, May 31 - Jun 5, 2012 Krakow, Poland
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2016 GlueX: Photoproduction of Hybrid Mesons], Elton Smith
*[http://cipanp2012.triumf.ca/ CIPANP 2012] Eleventh Conference on the Intersections of Particle and Nuclear Physics, May 29 - Jun 3, 2012, St. Petersburg, Florida
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2022 GlueX: Neutron Radiation Hardness of SiPM and Its Applications in Jefferson Lab Hall-D], Yi Qiang
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2675 Development of Level-1 Triggers for Experiments at Jefferson Lab], Alexander Somov

=== Talks in 2011 ===
*[http://web.mit.edu/panic11/ PANIC 11] The 19th Particles and Nuclei International Conference, in Cambridge, Massachusetts at the Massachusetts Institute of Technology (MIT)
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1792 GlueX: Detector Construction and Event Simulations], Naomi Jarvis.
* [http://www.hadron2011.de/ Hadron 2011], Munich, Germany, June 13-17, 2011
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1780 Search for Gluonic Excitations in Hadrons with GlueX], I. Senderovich
* [http://www.cap.ca/en/congress/2011 CAP Congress 2011], Memorial University of Newfoundland (St. John's, Newfoundland), June 13-17, 2011
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2123 The GlueX Electromagnetic Barrel Calorimeter], Z. Papandreou
* [http://www.aps.org/meetings/april/info/index.cfm APS Physics - April Meeting], Hyatt Regency Garden Grove, Anaheim, CA, April 30 - May 1, 2011.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1729 The GlueX Electromagnetic Barrel Calorimeter], Z. Papandreou.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1731 The Meson Spectrum from Lattice QCD], J. Dudek.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1734 The Experimental Spectrum of Hadrons], M. Shepherd.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1767 Overview of GlueX Offline Computing], M. Ito.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1727 Charged particle tracking for the gluex detector], S. Taylor.
* [https://sites.google.com/site/ghpworkshop/home The 4th Workshop of the APS Topical Group on Hadronic Physics], Hyatt Regency Garden Grove, Anaheim, CA, April 27-29, 2011.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1730 The Meson Spectrum from Lattice QCD], J. Dudek.
* [http://www1.jlab.org/ul/calendar/calendar_date.cfm?date=23&month=2&year=2011 Workshop on Excited Hadronic States and the Deconfinement Transition], Jefferson Lab, Newport News, VA, February 23-25, 2011.
** [http://www.curtismeyer.com/talks/C_Meyer_Hadron_Spectroscopy.pptx Spectroscopy: experimental status and prospects], Curtis A. Meyer.
* [http://www.sfu.ca/~caa12/WNPPC11/ WNPPC 2011], Winter Nuclear and Particle Physics Conference, 18-20 Feb 2011, Banff, Canada
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1711 The GlueX Barrel Calorimeter], Zisis Papandreou
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1712 Large Area Multi-Pixel Photon Detectors for the GlueX Barrel Calorimeter], Mehrnoosh Tahani
* [http://www.gsi.de/forschung/kp/had2/meeting/Hirschegg_2011.html Hirschegg 2011], The Structure and Dynamics of Hadrons, Hirschegg, Austria, January 16-22, 2011
** [http://argus.phys.uregina.ca/gluex/DocDB/0017/001706/001/hirschegg_gluex.pdf The GlueX Experiment (and its Context)], Ryan Mitchell

=== Talks in 2010 ===
* [http://www.lanl.gov/dnp DNP 2010], Santa Fe, NM, Nov 2 - 6.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1639 Level-1 Trigger of the GlueX Experiment], Alexander Somov
* Jefferson Lab PAC36, Jefferson Lab, Newport News, VA, August, 2010.
** [http://www.curtismeyer.com/talks/GlueX_Pac_Presentation.pptx GlueX Presentation], Curtis A. Meyer.
* [http://www.lnf.infn.it/public/ LNF Frascati], Frascati, Italy, June 30, 2010
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2124 The GlueX Experiment: construction is under way], Z. Papandreou
* [http://www.cap.ca/en/congress/2010 CAP Congress 2010], University of Toronto (Toronto, Ontario), June 7-11, 2010
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2123 The GlueX Electromagnetic Barrel Calorimeter], Z. Papandreou
* Jefferson Lab Users Group Meeting, Jefferson Lab, Newport News, VA, June 7-9, 2010.
** [http://argus.phys.uregina.ca/cgi-bin/public/DocDB/ShowDocument?docid=1544 GlueX/Hall-D Physics], Curtis A. Meyer.
* [https://www.jlab.org/conferences/MENU10 Meson-Nucleon Physics and the Structure of the Nucleon (MENU 2010)], Williamsburg, VA, May 31, 2010.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1540 Physics Prospects with GlueX], Alexander Somov

== Seminar Talks ==
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2432 GlueX Program at Hall-D], Pizza seminar at Jefferson Lab, Newport News, Feb 26, 2014, Yi Qiang
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2288 Physics Program at Jefferson Lab Hall-D], Seminar talk at Argonne National Lab, Chicago, August 2013, Yi Qiang
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1957 The Search for Gluonic Excitations in Light Mesons with the GlueX Experiment], Seminar Talk at CPHT, Ecole Polytechnique, France, April 2012, Igor Senderovich

== Colloquium Talks ==
* [http://www.curtismeyer.com/talks/UTK_Sep_11_Colloquium.pptx The Jefferson Lab 12-GeV Upgrade and the GlueX Experiment], Colloquium at The University of Tennessee, Knoxville, September 2011, Curtis A. Meyer.

* [http://www.curtismeyer.com/talks/ASU_Colloquium.pptx Quarks, QCD and Confinement: What we hope to learn at Jefferson Lab], Colloquium at ASU, February 2010, Curtis Meyer.

* [http://www.curtismeyer.com/talks/StVincent2.pptx Gluonic Hadrons as a Probe of Confinement], Undergraduate Colloquium at Saint Vincent College, Latrobe PA, September 2009, Curtis Meyer.

GlueX Talks

2015-11-02T20:42:20Z

Kmoriya: /* Talks in 2015 */

== Conference and Workshop Talks==

=== Talks in 2015 ===
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2687 Analysis Plans in GlueX] Athos/PWA Meeting, April 13-17 2015, GWU, presented by Curtis A. Meyer.
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2745 Light-quark meson spectroscopy with the GlueX experiment] CIPANP May 2015, presented by Mark M. Dalton.
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2795 Electromagnetic Production of Strangeness at Jefferson Lab] [http://lambda.phys.tohoku.ac.jp/hyp2015/ HYP2015], September 2015, presented by Kei Moriya
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2799 First GlueX Results] Hadron 2015, September 2015, presented by Curtis A. Meyer.

=== Talks in 2014 ===
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2506 Physics in Hall D] Jefferson Lab 2014 Users Group Meeting, June 2-4, 2014, presented by Curtis A. Meyer.
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2532 The GlueX experiment and the search for exotic mesons], PANIC 2014, August 25-29, Will Levine
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2541 Electromagnetic Strangeness Production at Jefferson Lab Energies,] XI Conference on Quark Confinement at the Hadron Spectrum, September 8-12, 2014, presented by R. A. Schumacher.
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2603 Eta Decay Programm at GlueX], MesonNet Meeting, INFN Frascati, September 29, Alexander Somov
* [[DNP-2014 abstract submitted by Richard Jones|Abstract submitted to Mini-symposium on Hybrid Mesons and Molecules]], DNP-2014, Oct. 7-12, 2014, by Richard Jones
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2689 GlueX Analysis] presented at the Future Directions in Partial Wave Analysis Workshop at Jefferson Lab, November 2014.

=== Talks in 2013 ===
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2375 Hybrid Meson Lectures] presented by Curtis Meyer in December 2013.
* Carnegie Mellon University undergraduate colloquium, November 2013. [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2373 Big Science] a talk on how large science projects are conceived, funded and built following the 12-GeV upgrade as an example (Curtis A. Meyer).
* [https://www.jlab.org/conferences/dnp2013/dnp-13.html APS-DNP 2013] October 23-26, 2013, Newport News, VA
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2355 Characteristics of Silicon Photomultipliers (SiPM) for GlueX], Yi Qiang
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2546 Reconstruction of showers in the GlueX barrel calorimeter], Will Levine
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2674 JLab Hall-D Photon Beamline], Alexander Somov
* [http://www.int.washington.edu/NNPSS/2013/HOME.html Nuclear Science Summer School] July 15-26, 2013, Stony Brook University campus. [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2278 Colloquium] presented by Curtis A. Meyer.
* [http://www.inpc2013.it INPC2013] 25th International Nuclear Physics Conference June 2-7, 2013 Florence, Italy
* [https://hep.ustc.edu.cn/indico/conferenceOtherViews.py?view=standard&confId=1 The Fifth Workshop on Hadron Physics in China and Opportunities in US], July 2-6, Huangshan, China
** [http://argus.phys.uregina.ca/gluex/DocDB/0022/002282/001/20130703_Physics_HallD.pdf Physics Program at Jefferson Lab Hall-D], Yi Qiang
* [http://www.aps.org/meetings/april/index.cfm APS] American Physical Society Meeting, April Mtg., Apr. 13-16, 2013, Denver, CO
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2283 Characteristics of S12045(X) photon sensors for GlueX], Elton Smith
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2284 The Barrel Calorimeter for the GlueX Experiment at Jefferson Lab], Zisis Papandreou (presented by Elton Smith)
* [https://sites.google.com/site/ghpworkshop/ GHP Workshop] Meeting of the APS Topical Group on Hadron Physics (GHP), Apr. 10-12, 2013, Denver, CO
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2189 Exploration of Exotic Mesons with GlueX], Elton Smith

=== Talks in 2012 ===

* PANDA XLIII Collaboration Meeting, Dec 10-14, 2012 GSI, Germany
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2122 GlueX: Photoproduction of Hybrid Mesons], Elton Smith
* Spectroscopy at 12 GeV Workshop, Jefferson Lab
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2192 Current PWA Projects in Hall D], Matt Shepherd
*[http://www.aps.org/units/dnp/meetings/meeting.cfm?name=DNP12/ DNP 2012] 2012 Fall Meeting of the APS Division of Nuclear Physics, October 24 - Ocober 27, 2012 Newport Beach, California
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2110 GlueX: Thin Diamond Radiators for the GlueX Experiment], Brendan Pratt
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2111 Collimation and Tagging Instrumentation for the GlueX Photon Beamline], R.T. Jones
* PANDA XLII Collaboration Meeting, June 26-29, 2012 Evanston, IL
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2193 GlueX: A search for light quark exotic mesons at Jefferson Lab], Matt Shepherd
* [http://www.ge.infn.it/~athos12/ATHOS/Welcome.html ATHOS 2012] International Workshop on partial wave analysis, June 20-23, 2012.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2029 Talk] presented by Curtis Meyer.
* [http://www.cap.ca/en/congress/2012 CAP Congress 2012], University of Calgary (Calgary, Alberta), June 11-15, 2012
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2026 The GlueX Large Area Silicon Photomultipliers], Z. Papandreou
*[http://www.jlab.org/conferences/ugm/program.htm Jefferson Lab User's Meeting], June 4-6, 2012.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2025 Talk] presented by Curtis Meyer.
*[http://meson.if.uj.edu.pl/ MESON 2012] 12th International Workshop on Meson Production, Properties and Interaction, May 31 - Jun 5, 2012 Krakow, Poland
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2016 GlueX: Photoproduction of Hybrid Mesons], Elton Smith
*[http://cipanp2012.triumf.ca/ CIPANP 2012] Eleventh Conference on the Intersections of Particle and Nuclear Physics, May 29 - Jun 3, 2012, St. Petersburg, Florida
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2022 GlueX: Neutron Radiation Hardness of SiPM and Its Applications in Jefferson Lab Hall-D], Yi Qiang
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2675 Development of Level-1 Triggers for Experiments at Jefferson Lab], Alexander Somov

=== Talks in 2011 ===
*[http://web.mit.edu/panic11/ PANIC 11] The 19th Particles and Nuclei International Conference, in Cambridge, Massachusetts at the Massachusetts Institute of Technology (MIT)
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1792 GlueX: Detector Construction and Event Simulations], Naomi Jarvis.
* [http://www.hadron2011.de/ Hadron 2011], Munich, Germany, June 13-17, 2011
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1780 Search for Gluonic Excitations in Hadrons with GlueX], I. Senderovich
* [http://www.cap.ca/en/congress/2011 CAP Congress 2011], Memorial University of Newfoundland (St. John's, Newfoundland), June 13-17, 2011
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2123 The GlueX Electromagnetic Barrel Calorimeter], Z. Papandreou
* [http://www.aps.org/meetings/april/info/index.cfm APS Physics - April Meeting], Hyatt Regency Garden Grove, Anaheim, CA, April 30 - May 1, 2011.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1729 The GlueX Electromagnetic Barrel Calorimeter], Z. Papandreou.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1731 The Meson Spectrum from Lattice QCD], J. Dudek.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1734 The Experimental Spectrum of Hadrons], M. Shepherd.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1767 Overview of GlueX Offline Computing], M. Ito.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1727 Charged particle tracking for the gluex detector], S. Taylor.
* [https://sites.google.com/site/ghpworkshop/home The 4th Workshop of the APS Topical Group on Hadronic Physics], Hyatt Regency Garden Grove, Anaheim, CA, April 27-29, 2011.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1730 The Meson Spectrum from Lattice QCD], J. Dudek.
* [http://www1.jlab.org/ul/calendar/calendar_date.cfm?date=23&month=2&year=2011 Workshop on Excited Hadronic States and the Deconfinement Transition], Jefferson Lab, Newport News, VA, February 23-25, 2011.
** [http://www.curtismeyer.com/talks/C_Meyer_Hadron_Spectroscopy.pptx Spectroscopy: experimental status and prospects], Curtis A. Meyer.
* [http://www.sfu.ca/~caa12/WNPPC11/ WNPPC 2011], Winter Nuclear and Particle Physics Conference, 18-20 Feb 2011, Banff, Canada
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1711 The GlueX Barrel Calorimeter], Zisis Papandreou
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1712 Large Area Multi-Pixel Photon Detectors for the GlueX Barrel Calorimeter], Mehrnoosh Tahani
* [http://www.gsi.de/forschung/kp/had2/meeting/Hirschegg_2011.html Hirschegg 2011], The Structure and Dynamics of Hadrons, Hirschegg, Austria, January 16-22, 2011
** [http://argus.phys.uregina.ca/gluex/DocDB/0017/001706/001/hirschegg_gluex.pdf The GlueX Experiment (and its Context)], Ryan Mitchell

=== Talks in 2010 ===
* [http://www.lanl.gov/dnp DNP 2010], Santa Fe, NM, Nov 2 - 6.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1639 Level-1 Trigger of the GlueX Experiment], Alexander Somov
* Jefferson Lab PAC36, Jefferson Lab, Newport News, VA, August, 2010.
** [http://www.curtismeyer.com/talks/GlueX_Pac_Presentation.pptx GlueX Presentation], Curtis A. Meyer.
* [http://www.lnf.infn.it/public/ LNF Frascati], Frascati, Italy, June 30, 2010
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2124 The GlueX Experiment: construction is under way], Z. Papandreou
* [http://www.cap.ca/en/congress/2010 CAP Congress 2010], University of Toronto (Toronto, Ontario), June 7-11, 2010
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2123 The GlueX Electromagnetic Barrel Calorimeter], Z. Papandreou
* Jefferson Lab Users Group Meeting, Jefferson Lab, Newport News, VA, June 7-9, 2010.
** [http://argus.phys.uregina.ca/cgi-bin/public/DocDB/ShowDocument?docid=1544 GlueX/Hall-D Physics], Curtis A. Meyer.
* [https://www.jlab.org/conferences/MENU10 Meson-Nucleon Physics and the Structure of the Nucleon (MENU 2010)], Williamsburg, VA, May 31, 2010.
** [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1540 Physics Prospects with GlueX], Alexander Somov

== Seminar Talks ==
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2432 GlueX Program at Hall-D], Pizza seminar at Jefferson Lab, Newport News, Feb 26, 2014, Yi Qiang
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2288 Physics Program at Jefferson Lab Hall-D], Seminar talk at Argonne National Lab, Chicago, August 2013, Yi Qiang
* [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1957 The Search for Gluonic Excitations in Light Mesons with the GlueX Experiment], Seminar Talk at CPHT, Ecole Polytechnique, France, April 2012, Igor Senderovich

== Colloquium Talks ==
* [http://www.curtismeyer.com/talks/UTK_Sep_11_Colloquium.pptx The Jefferson Lab 12-GeV Upgrade and the GlueX Experiment], Colloquium at The University of Tennessee, Knoxville, September 2011, Curtis A. Meyer.

* [http://www.curtismeyer.com/talks/ASU_Colloquium.pptx Quarks, QCD and Confinement: What we hope to learn at Jefferson Lab], Colloquium at ASU, February 2010, Curtis Meyer.

* [http://www.curtismeyer.com/talks/StVincent2.pptx Gluonic Hadrons as a Probe of Confinement], Undergraduate Colloquium at Saint Vincent College, Latrobe PA, September 2009, Curtis Meyer.

GlueX Offline Meeting, October 28, 2015

2015-10-28T17:24:23Z

Kmoriya: /* Agenda */

GlueX Offline Software Meeting 
Wednesday, October 28, 2015 
1:30 pm EDT 
JLab: CEBAF Center F326/327

==Agenda==

# Announcements
## new Version Management System features: URL checking for Git, hash-specific check-outs, [http://argus.phys.uregina.ca/cgi-bin/public/DocDB/ShowDocument?docid=2793 document update]
## [http://www.jlab.org/Hall-D/software/HDSoftware_Documentation/ Doxygen] fixed
## [https://mailman.jlab.org/pipermail/halld-offline/2015-October/002155.html New releases: sim-recon 1.6.0 and HDDS 3.4]
## [https://halldweb.jlab.org/wiki-private Private wiki] getting there
## [https://halldweb.jlab.org/wiki/index.php?title=GlueX_Offline_FAQ&diff=70904&oldid=70249 Recent additions to the Offline FAQ]
# Review of [[GlueX Offline Meeting, October 14, 2015#Minutes|minutes from October 14]] (all)
# [https://halldweb1.jlab.org/wiki/images/1/1b/2015-10-27-offline_monitoring.pdf Offline Monitoring] (Kei)
# Geant4 Update (Richard, David)
# [[Data Challenge 3]] update (Mark)
# [[Sim1 Conditions|Future Commissioning Simulations]] (all)
# [[Automatic Tests of GlueX Software|b1pi results review]]
# Review of [https://github.com/JeffersonLab/sim-recon/pulls?q=is%3Aopen+is%3Apr recent pull requests]
#* comments on merge
# Action Item Review

==Communication Information==

===Remote Connection===

* The BlueJeans meeting number is 968 592 007 .
* [http://bluejeans.com/968592007 Join the Meeting] via BlueJeans

===Slides===

Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2015</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2015/ .

File:2015-10-27-offline monitoring.pdf

2015-10-28T17:23:46Z

Kmoriya: Talk at offline meeting on Oct 28 2015 for offline monitoring by Kei

Talk at offline meeting on Oct 28 2015 for offline monitoring by Kei

Data Monitoring Procedures

2015-10-26T14:38:22Z

Kmoriya: /* Starting the Launch and Submitting Jobs */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: FOR BUILDING SOFTWARE IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT EACH TIME TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>cd hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>cd sim-recon/src</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015 <pre>cd ~/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre> <pre>mv ccdb.sqlite ../</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. As an example config file, see the input.config file in the folder (and update it). When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run -workflow [workflow] -errorlimit none</pre> MAKE SURE THE ERRORLIMIT IS SET TO NONE OR THE WORKFLOW WILL BE STOPPED AFTER ANY JOB FAILS or equivalently, using the hdswif wrapper (which has the errorlimit set by default), <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
# The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> on Auger and for SWIF with <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].
# For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs -ram add 2gb -problems AUGER-OVER_RLIMIT</pre> hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.
# For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Add a new data version
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-23T23:28:30Z

Kmoriya: /* Checking the Status and Resubmitting */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: FOR BUILDING SOFTWARE IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT EACH TIME TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>cd hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>cd sim-recon/src</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015 <pre>cd ~/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre> <pre>mv ccdb.sqlite ../</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. As an example config file, see the input.config file in the folder (and update it). When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run -workflow [workflow]</pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
# The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> on Auger and for SWIF with <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].
# For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs -ram add 2gb -problems AUGER-OVER_RLIMIT</pre> hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.
# For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-23T23:23:02Z

Kmoriya: /* Starting the Launch and Submitting Jobs */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: FOR BUILDING SOFTWARE IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT EACH TIME TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>cd hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>cd sim-recon/src</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015 <pre>cd ~/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre> <pre>mv ccdb.sqlite ../</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. As an example config file, see the input.config file in the folder (and update it). When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run -workflow [workflow]</pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Checking the Status and Resubmitting ===
# The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> on Auger and for SWIF with <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].
# For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources, <pre>swif retry-jobs [workflow] -problems [problem name]</pre> can be used, and for jobs to be submitted with more resources, e.g., use <pre>swif modify-jobs -ram add 2gb -problems AUGER-OVER_RLIMIT</pre> hdswif has a wrapper for both of these: <pre>hdswif.py resubmit [workflow] [problem]</pre> In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g., <pre>hdswif.py resubmit [workflow] TIMEOUT 5</pre> will add 5 hours of processing time.
# For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-22T19:34:17Z

Kmoriya: /* Preparing the software for the launch */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: FOR BUILDING SOFTWARE IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT EACH TIME TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015 <pre>cd ~/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre> <pre>mv ccdb.sqlite ../</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run </pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-21T18:34:07Z

Kmoriya: /* Preparing the software for the launch */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: FOR BUILDING SOFTWARE IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT EACH TIME TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////group/halld/Users/gxproj5/builds/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to this directory and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. NOTE: SQLITE FILES DO NOT WORK ON THE NEW /work DISK INSTALLED IN OCTOBER 2015, USE THE /group DISK <pre>cd /group/halld/Users/gxproj5/builds/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre> <pre>mv ccdb.sqlite ../</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run </pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-19T14:29:52Z

Kmoriya: /* Starting the Launch and Submitting Jobs */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: FOR BUILDING SOFTWARE IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT EACH TIME TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/builds/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to ~/builds,and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. <pre>cd ~/builds/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre> <pre>mv ccdb.sqlite ../</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#* The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
#* To use the git-tagged software versions do for example <pre>cd $HALLD_HOME</pre> <pre>git checkout offmon-2015_03-ver15</pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run </pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-19T14:27:24Z

Kmoriya: /* Starting the Launch and Submitting Jobs */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: FOR BUILDING SOFTWARE IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT EACH TIME TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/builds/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to ~/builds,and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. <pre>cd ~/builds/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre> <pre>mv ccdb.sqlite ../</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
#The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence: <pre>git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"</pre> <pre>git push offmon-201Y_MM-verVV</pre> This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run </pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-15T14:46:08Z

Kmoriya: /* Offline Monitoring: Running Over Archived Data */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here] for how each account is used).
As of October 2015, the following are used:
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run the offline monitoring each package should be checked out and all necessary
scripts are included.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxprojN/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: FOR BUILDING SOFTWARE IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT EACH TIME TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/builds/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to ~/builds,and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. <pre>cd ~/builds/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre> <pre>mv ccdb.sqlite ../</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> For creation of workflows for offline monitoring the command <pre>hdswif.py create [workflow] -c [config file] </pre> should be used. When a config file is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example
#* /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_2015_03_ver15.conf
#* /group/halld/data_monitoring/run_conditions/soft_comm_2015_03_ver15.xml
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

Running the workflow: To run the workflow, simply use swif run: <pre>swif run </pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.
# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre> For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-15T02:02:51Z

Kmoriya: /* Starting the Launch and Submitting Jobs */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing_Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here]).
As of October 2015, we have been using
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run each package the directories will need to be checked out and the necessary
scripts will need to be run.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxproj5/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT OF A NEW VERSION OF EACH SOFTWARE TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/builds/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to ~/builds,and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. <pre>cd ~/builds/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, or most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> <pre>swif create [workflow]</pre> Or equivalently, <pre>hdswif.py create [workflow] </pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 6
DISK 40
RAM 8
TIMELIMIT 8
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 15
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
The config file contains configuration parameters for each of the jobs. 
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).
# Running the workflow: To run the workflow, simply use swif run: <pre>swif run </pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre> 
It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.

# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre>
For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-15T01:40:40Z

Kmoriya: /* Post-analysis of statistics of the launch */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing_Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here]).
As of October 2015, we have been using
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run each package the directories will need to be checked out and the necessary
scripts will need to be run.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxproj5/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT OF A NEW VERSION OF EACH SOFTWARE TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/builds/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to ~/builds,and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. <pre>cd ~/builds/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, or most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> <pre>swif create [workflow]</pre> Or equivalently, <pre>hdswif.py create [workflow] </pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
<pre>
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 4
DISK 40
RAM 8
TIMELIMIT 16
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 92
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
</pre>
The config file contains configuration parameters for each of the jobs.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).
# Running the workflow: To run the workflow, simply use swif run: <pre>swif run </pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre>
It is recommended that some jobs be tested over to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.

# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre>
For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre> Note that this script uses the XML output from hdswif summary and inserts the contents into the MySQL table, so the XML output file must exist.
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-15T01:38:35Z

Kmoriya: /* Post-analysis of statistics of the launch */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing_Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here]).
As of October 2015, we have been using
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run each package the directories will need to be checked out and the necessary
scripts will need to be run.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxproj5/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT OF A NEW VERSION OF EACH SOFTWARE TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/builds/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to ~/builds,and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. <pre>cd ~/builds/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, or most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> <pre>swif create [workflow]</pre> Or equivalently, <pre>hdswif.py create [workflow] </pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
<pre>
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 4
DISK 40
RAM 8
TIMELIMIT 16
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 92
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
</pre>
The config file contains configuration parameters for each of the jobs.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).
# Running the workflow: To run the workflow, simply use swif run: <pre>swif run </pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre>
It is recommended that some jobs be tested over to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
#* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
#* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.

# The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre>
For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj
# The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.
# Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis. 
All of the analysis commands including arguments are contained in run_analysis.sh, but it is strongly recommended that all commands are run manually to check for errors. 
# The first thing to do is to make the html output from hdswif public. Copy the html file and related figures that were created from hdswif to the appropriate space within /group/halld/www/halldweb/html/data_monitoring/ : <pre>python publish_offmon_results.py [run period] [version]</pre> Note that the command with appropriate substitutions for arguments can be found within run_analysis.sh (same for all commands below).
# Next, we need to create a few MySQL tables for the current launch. The MySQL tables are useful for comparing run/file combinations for different launches. For the SWIF launches, there are two tables needed, and will be named
#* [workflow]Job
#* [workflow]_aux
# This naming scheme of tables and their roles are the same as from the jproj only launches. The [workflow]Job table will contain information gathered from SWIF about each job (which node it went to, start time of each stage, memory usage, etc.). The [workflow]_aux table will contain information gathered from the output stdout files from each job. First, create the Job table using <pre>python create_jproj_job_table.py [run period] [version]</pre>
# To check the contents of this MySQL table, do <pre>mysql -hhalldb -ufarmer farming</pre> <pre>mysql> describe offline_monitoring_RunPeriod2015_03_ver11_hd_rawdataJob;</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-15T01:02:09Z

Kmoriya: /* Post-analysis of statistics of the launch */

__TOC__

== Master List of File / Database / Webpage Locations ==
=== Run Conditions ===
* Online Run-by-run condition files (B-field, current, etc.): /work/halld/online_monitoring/conditions/
* Offline monitoring run conditions (software versions, jana config): /group/halld/data_monitoring/run_conditions/
*[http://www.jlab.org/Hall-D/test/RunInfo/ Run Info vers. 1]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/run_conditions.pl Run Info vers. 2]

=== Monitoring Output Files ===
* Run Periods 201Y-MM is for example 2015-03, launch ver verVV is for example ver15
* Online monitoring histograms: /work/halld/online_monitoring/root/
* Offline monitoring histogram ROOT files: /work/halld/data_monitoring/RunPeriod-201Y-MM/verVV/rootfiles
* REST files (most recent launch only): /work/halld/data_monitoring/RunPeriod-201Y-MM/REST/verVV
* individual files for each job (ROOT, log, etc.): /volatile/halld/offline_monitoring/RunPeriod-201Y-MM/verVV/

=== Monitoring Database ===
* Accessing monitoring database (on ifarm): mysql -u datmon -h hallddb.jlab.org data_monitoring

=== Monitoring Webpages ===
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/plotBrowser.py Plot Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/runBrowser.py Run Browser]
*[https://halldweb.jlab.org/cgi-bin/data_monitoring/monitoring/timeSeries.py Time Series]
*[https://halldweb.jlab.org/data_monitoring/launch_analysis/ Launch Analysis]

== Job Monitoring Links ==
* [http://scicomp.jlab.org/farm2/job.html Completed Job History]
* [http://scicomp.jlab.org/farm2/project.html Job Stats By Project]
* [http://scicomp.jlab.org/farm2/trackOrg.html Job Stats By Track]
* [http://scicomp.jlab.org/farm2/report.html Cluster Report]
* [http://scicomp.jlab.org/farm2/walltime.html Walltime Distribution]

== Saving Online Monitoring Data ==

The procedure for writing the data out is given in, e.g.,
[https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy Raid-to-Silo Transfer Strategy].

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape,
and within ~20 min., we will have access to the file on tape at
/mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken.
They will be accessible within the counting house via RootSpy, and
for each run and file, a ROOT file containing the histograms will be saved
within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly
from the counting house, or the tape files will be available within ~20 min. of the
file being written out.

== Offline Monitoring: Running Over Archived Data ==

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.
Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector.
For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for
# Preparing the software for the launch
# Starting the launch (using hdswif)
# Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration
is handled in the section [[#Post-Processing_Procedures | Post-Processing_Procedures]] below.



=== General Information on Procedures ===
This section explains how the offline monitoring should be run.
Since we may want to simultaneously run offline monitoring for different run periods that require
different environment variables, the scripts are set up so that a generic user can download the
scripts and run them from anywhere. Most output directories for offline monitoring are created
with group read/write permissions so that any Hall D group user has access to the contents,
but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by
Mark Ito (see [https://halldweb.jlab.org/wiki/index.php/GlueX-related_shared_accounts_on_the_JLab_CUE here]).
As of October 2015, we have been using
* gxproj1 for running over Fall 2014 data (deprecated since June 2015)
* gxproj5 for running over Spring 2015 data

Since the summer of 2015 we have transitioned from a system using Mark Ito's jproj scripts
to integrating the swif system that Chris Larrieu (SciComp) has been developing. For offline monitoring,
the hdswif system that Kei developed is used for launching the jobs, and the jproj system is used
for meta-analysis of launch statistics.

Both hdswif and jproj are maintained in svn:
* hdswif: https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
* jproj : https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj

To run each package the directories will need to be checked out and the necessary
scripts will need to be run.

=== Preparing the software for the launch ===

To begin a new launch the software must be built to the latest versions.
For the gxprojN user accounts used, all software builds are contained in the directory ~/builds
(which are soft links to /work/halld/home/gxproj5/builds). When logging into these accounts
the setup files ~/setup_jlab-2015-03.csh or similar files should be sourced.

Note that Mark Ito does not want you to change the contents of each .cshrc file.
You should consult him if you feel the need.

NOTE: IT IS A GOOD IDEA TO DO A COMPLETE WIPEOUT/CHECKOUT OF A NEW VERSION OF EACH SOFTWARE TO AVOID STALE HEADER FILES.

# Building hdds: Go to ~/builds/hdds. The directory hdds is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/hdds/</pre> <pre>rm -frv hdds</pre> <pre>git clone https://github.com/JeffersonLab/hdds</pre> <pre>scons install</pre>
# Building sim-recon: Go to ~/builds/sim-recon. The directory sim-recon is the one from git. Delete the contents, then download the newest version from git and build: <pre>cd ~/builds/sim-recon</pre> <pre>rm -frv sim-recon</pre> <pre>git clone https://github.com/JeffersonLab/sim-recon</pre> <pre>scons install -j8</pre>
# Prepare the latest sqlite file: The sqlite is set in the ~/setup_jlab-2015-03.csh script as sqlite:////home/gxproj5/builds/ccdb.sqlite through the environment variables JANA_CALIB_URL and CCDB_CONNECTION. Therefore, go to ~/builds,and create a new sqlite file. We create the sqlite file in a temporary directory since creating the sqlite file in a directory where the output file exists causes errors. Original documentation on creating sqlite files are [https://halldweb.jlab.org/wiki/index.php/SQLite-form_of_the_CCDB_database here]. <pre>cd ~/builds/tmp</pre> <pre>$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite</pre>
# Note that the above steps must be done BEFORE launch project creation. This is because we will track the revisions of the libraries used, and this is done by extracting the svn information in each directory. Also, note that the system assumes that we have the topmost build directory (usually called GLUEX_TOP) to be $HOME/builds . Such an assumption is necessary to be able to extract information about the library locations automatically.

Create the appropriate project(s) and submit the jobs using hdswif, as detailed in the section below.

=== Starting the Launch and Submitting Jobs ===

Until the summer of 2015 we relied solely on Mark Ito's jproj system for submitting and keeping track of jobs. We have since moved to the swif system and use the hdswif wrapper for this. Below are instructions for how to use these.

# Downloading hdswif: Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/halld/hdswif. <pre>cd ~/halld </pre> <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif</pre> <pre>cd hdswif</pre>
# Creating the workflow: Within SWIF jobs are registered into workflows. First create the workflow. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod201Y_MM_verVV_hd_rawdata with suitable replacements for the run period and version number. The command "swif list" will list all existing workflows. Also, or most simple SWIF commands hdswif also provides a wrapper. <pre>swif list</pre> <pre>swif create [workflow]</pre> Or equivalently, <pre>hdswif.py create [workflow] </pre>
# Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. A typical config file will look this:
<pre>
PROJECT gluex
TRACK reconstruction
OS centos65
NCORES 4
DISK 40
RAM 8
TIMELIMIT 16
JOBNAMEBASE offmon_
RUNPERIOD 2015-03
VERSION 92
OUTPUT_TOPDIR /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other variables included in variable
SCRIPTFILE /home/gxproj5/halld/hdswif/script.sh # Must specify full path
ENVFILE /home/gxproj5/halld/hdswif/setup_jlab-2015-03.csh # Must specify full path
</pre>
The config file contains configuration parameters for each of the jobs.
Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.
Edit the config file and save as a new file if necessary. Once the configuration is set, jobs can be added via <pre>hdswif.py add [workflow] -c input.config</pre>
By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with
<pre>hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'</pre>
to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).
# Running the workflow: To run the workflow, simply use swif run: <pre>swif run </pre> or equivalently, using the hdswif wrapper, <pre>hdsswif.py run [workflow]</pre>
It is recommended that some jobs be tested over to make sure that everything is working rather than fail thousands of jobs. For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit: <pre>hdswif.py run [workflow] 10</pre>
in which case only 10 jobs will be submitted. To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>

=== Post-analysis of statistics of the launch ===

# After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
* Status of Auger: http://scicomp.jlab.org/scicomp/#/auger/jobs (see also links above)
* Status of user jobs: <pre>jobstat -u [user name]</pre>
# The status and results of jobs are saved within the SWIF internal server, and are available via the command <pre>swif status [workflow] -summary -runs</pre> where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do <pre>hdswif.py summary [workflow]</pre> This will create an XML file swif_output_[workflow].xml that contains all information from SWIF. If the file already exists, hdswif will ask whether to overwrite the existing file.
# At this stage the html output and figure files are created and ready to be put online. For this step and other steps involving analysis of the statistics of the launch results, it is convenient to change to the jproj system.

The jproj scripts for offline monitoring are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj Do <pre> svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj </pre>
For the gxprojN accounts used for offline monitoring the directory should be ~/halld/jproj

The jproj directory contains two subdirectories, scripts and projects. The scripts directory contains useful scripts for processing the jobs registered in the jproj system, and each of the offline monitoring launches will be handled in the projects directory.

Go to the projects directory<pre>cd ~/halld/jproj/projects</pre> and use the script create_project.sh to create a new directory that contains the processing scripts for the current launch <pre>./create_project.sh [workflow]</pre> This should create a directory such as offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata (same as the workflow name). The script uses the template files in the directory templates and by substitution creates script files for the current launch. Now go to the newly created analysis directory: <pre>cd [workflow]/analysis</pre>
which for the gxprojN accounts should have the full path /home/gxprojN/halld/jproj/projects/[workflow]/analysis.

Once the html output is created, copy the html file and related figures: <pre>cp -r summary_swif_output_[workflow].html figures/</pre>

== Hall D Job Management System ==

This section details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

=== Database Table Overview ===

* Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.

* Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.

* Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

=== Initialize Project Management ===

* Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
<pre>
ssh gxproj1@ifarm -Y
</pre>

* Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/

* Check out the necessary scripts <pre>svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj</pre> This will get all necessary scripts for launching. Once checked out, <pre>cd projects</pre>

* The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This
information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .
To create a project do <pre>./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata</pre>
The name has been chosen to be as consistent as possible with other directory structures.
However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be
given as 2YYY_MM instead of 2YYY-MM.

* A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.

* For each project, cd into the new directory
* The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
<pre>./clear.sh</pre>
* To use the jproj.pl script that was checked in, add the directory to your path with
<pre>source ../../scripts/setup.csh</pre>
or always specify the full path
<pre>../../scripts/jproj.pl</pre>
* Now update the table of runs with
<pre>jproj.pl <project name> update</pre>
This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.
* Once you have registered all of the files you would like to run over, do
<pre>jproj.pl <project name> submit [max # of jobs] [run number]</pre>
where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs
at first to check that all scripts are working and that the plugins do not crash. Once you are
sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring
which will (among other things) put the results on the online webpage for the collaboration to view, and
the analysis of the launch.

=== Project File Overview ===

An overview of each project file:
* '''clear.sh:''' For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
* '''<project_name>.jproj:''' Contains the path and file name format for the input files for the jobs.
* '''<project_name>.jsub:''' The xml job submission script. The run number and file number variables are set during job submission for each input file.
* '''script.sh:''' The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
<pre>
mkdir -p -m 775 my_directory
</pre>
* '''setup_jlab-[run period].csh:''' The environment that is sourced at the beginning of the job execution.
* '''status.sh:''' Updates the job status database table, and prints some of its columns to screen.

=== Project Management ===

* Delete (if any) and create the database table(s) for the current set of job submissions:
<pre>
./clear.sh
</pre>
Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:
<pre>
rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
</pre>

* Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
<pre>
jproj.pl <project_name> update <optional_file_number>
</pre>

* Confirm that the job management database is accurate by printing it's contents to screen:
<pre>
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
</pre>

* ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
<pre>
./clear.sh
</pre>

* To look at the status of the submitted jobs, first query auger and update the job status database:
<pre>
fill_in_job_details.pl <project_name>
</pre>

* The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
<pre>
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
</pre>

* These last two commands can instead be executed simultaneously by running:
<pre>
./status.sh
</pre>

=== Handy mysql Instructions ===

* Handy mysql instructions:
<pre>
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table
</pre>

=== Backing Up Offline Monitoring Tables ===
Tables created for offline monitoring can be backed up using the script backup_tables.sh which
can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables.
Since executing this output file will drop the table when it exists, caution is advised.
Example usage to backup all three tables created for run period 2014_10 ver 17:
<pre>backup_tables.sh 2014_10 17</pre>

== Running Over Data As It Comes In ==

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss.
During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs
run the previous Friday. The procedure for this is shown below.



=== Running the cron job ===

'''IMPORTANT:''' The cron job should not be running while you are manually submitting jobs using the jproj.pl script for the same project, or else you will probably multiply-submit a job.

* Go to the cron job directory:
<pre>
cd /u/home/gxproj1/halld/monitoring/newruns
</pre>

* The cron_plugins file is the cronjob that will be executed. During execution, it runs the exec.sh command in the same folder. This command takes two arguments: the project name, and the maximum file number for each run. These fields should be updated in the cron_plugins file before running.

* The exec.sh command updates the job management database table with any data that has arrived on tape since it was last updated, ignoring file numbers greater than the maximum file number. It then submits jobs for these files.

* To start the cron job, run:
<pre>
crontab cron_plugins
</pre>

* To check whether the cron job is running, do
<pre>
crontab -l
</pre>

* To remove the cron job do
<pre>
crontab -r
</pre>



==Post-Processing Procedures==

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are primarily run from /home/gxprojN/halld/monitoring/process . If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
<syntaxhighlight>
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
</syntaxhighlight>
Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command
<syntaxhighlight>
source /home/gxproj1/halld/monitoring/process/monitoring_env.csh
</syntaxhighlight>

===Online Monitoring===

There are two primary scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:
<syntaxhighlight>
/home/gluex/halld/monitoring/process/check_new_runs.py

OR

/home/gluex/halld/monitoring/process/check_new_runs.csh
</syntaxhighlight>
The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules included in the local installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job. The cronjob is currently run under the "gluex" account.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

===Offline Monitoring===

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:
<syntaxhighlight>
/home/gxproj1/halld/monitoring/process/check_monitoring_data.csh
</syntaxhighlight>
This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:
* histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
* macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:
# Add a new data version, as described below:
# Change the following parameters in check_monitoring_data.csh:
## JOBDATE should correspond to the ouptut date used by the job submission script
## OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
## Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

<syntaxhighlight>
Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "
</syntaxhighlight>
If you want to process the results manually, the data is processed using the following script:
<syntaxhighlight>
./process_new_offline_data.py <input directory> <output directory>

EXAMPLE:

./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02
</syntaxhighlight>
The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:
<syntaxhighlight>
./register_new_version.py add /home/gxproj1/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt
</syntaxhighlight>

If you are running the offline monitoring by checking out the files in trunk/scripts/monitoring/jproj/projects/
of the svn repository, and created a project with
create_project.sh [project name] hd_rawdata
Then go to the directory [project name]/processing/
and execute
./run_processing.sh
which will run register_new_version.py as well as check_monitoring_data.csh for that project.

===Step-by-Step Instructions For Processing a New Monitoring Run===

The monitoring runs are current run out of the gxproj1 and gxproj5 accounts. After an offline monitoring run has been successfully started on the batch farm, the following steps should be followed to setup the post-processing for these runs.

# The post-processing scripts are stored in $HOME/halld/monitoring/process and are automatically run by cron.
# Run "svn update" to bring any changes in. Be sure that the list of histograms and macros to plot are current.
# Edit check_monitoring_data.csh to point to the current revisions/directories
#* VERSION
#* ARGS
#* Note that the environment depends on a standard script - $HOME/setup_jlab.csh
# Update files in the web directory, so that the results are displayed on the web pages: /group/halld/www/halldweb/html/data_monitoring/textdata
# Copy the REST files to more permanent locations:
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /work/halld/data_monitoring/RunPeriod-YYYY-MM/REST/verVV
#* cp -a /volatile/halld/offline_monitoring/RunPeriod-YYYY-MM/verVV/REST /cache/halld/RunPeriod-YYYY-MM/REST/verVV [under testing]

Check log files in $HOME/halld/monitoring/process/log for more information on how each run went. If there are problems, check log files, and modify check_monitoring_data.csh to vary the verbosity of the output.

==Data Versions==

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

{| class="wikitable"
! Field !! Description
|-
| data_type || The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
|-
| run_period || The run period of the data
|-
| revision || An integer specifying which pass through the run period this data corresponds to
|-
| software_version || The name of the XML file that specifies the different software versions used
|-
| jana_config || The name of the text file that specifies which JANA options were passed to the reconstruction program
|-
| ccdb_context || The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
|-
| production_time || The data at which monitoring/reconstruction began
|-
| dataVersionString || A convenient string for identifying this version of the data
|}

An example file used as as input to ./register_new_version.py is:
<syntaxhighlight>
data_type = recon
run_period = RunPeriod-2014-10
revision = 1
software_version = soft_comm_2014_11_06.xml
jana_config = jana_rawdata_comm_2014_11_06.conf
ccdb_context = calibtime=2014-11-10
production_time = 2014-11-10
dataVersionString = recon_RunPeriod-2014-10_20141110_ver01
</syntaxhighlight>

Data Monitoring Procedures

2015-10-14T20:33:53Z

Kmoriya: /* Offline Monitoring: Running Over Archived Data */

Data Monitoring Procedures

2015-10-14T20:22:06Z

Kmoriya: /* Starting the Launch and Submitting Jobs */

Data Monitoring Procedures

2015-10-14T18:45:27Z

Kmoriya: /* Offline Monitoring: Running Over Archived Data */

Data Monitoring Procedures

2015-10-14T18:14:35Z

Kmoriya: /* Offline Monitoring: Running Over Archived Data */

Data Monitoring Procedures

2015-10-14T17:55:01Z

Kmoriya: /* Procedures */

Data Monitoring Procedures

2015-10-14T17:47:45Z

Kmoriya: /* Offline Monitoring: Running Over Archived Data */

Data Monitoring Procedures

2015-10-14T17:44:23Z

Kmoriya: /* Offline Monitoring: Running Over Archived Data */

Data Monitoring Procedures

2015-10-14T17:37:08Z

Kmoriya: /* Master List of File / Database / Webpage Locations */

GlueX-Collaboration-Oct-2015

2015-10-09T13:22:23Z

Kmoriya: /* Friday October 9, 2015 */

== GlueX Collaboration Meeting ==

October 8 to 10, 2015 at Jefferson Lab

The following template has been set up to allow people to identify what needs to be presented at the meeting. It has been roughly broken out by working group with suggestions for topics within each group. The working group chairs should work on adding the relevant talks with an estimated time (including questions) and the speaker's name. If there is a talk that does not fit into this template, please add it at the bottom.

== Registration ==
Everyone participating in the collaboration meeting whether in person at JLab or remotely via electronic media are encouraged to register. Please visit the
[https://misportal.jlab.org/Ul/conferences/generic_conference/registration.cfm?conference_id=COLLAB-GLUEX-OCT2015 Registration Page].

To see who else will be attending either in person or via electronic media, please see the
[https://misportal.jlab.org/Ul/conferences/generic_conference/participants.cfm?conference_id=COLLAB-GLUEX-OCT2015 Current List of Participants Page]

== Location ==
CEBAF Center L102

== Remote Access ==
# To join via a Web Browser, go to the page [https://bluejeans.com/660743227] https://bluejeans.com/660743227.
# To join via Polycom room system go to the IP Address: 199.48.152.152 ([http://bjn.vc bjn.vc]) and enter the meeting ID: 660743227.
# To join via phone, use one of the following numbers and the Conference ID: 660743227.
#* US or Canada: +1 408 740 7256 or
#* US or Canada: +1 888 240 2560
# More information on connecting to [[Connect to the Data Challenge Meetings|bluejeans]] is available.

== Talks on the DocDB ==

The DocDB talk-upload template for the October 2015 Collaboration Meeting can be found here:

http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?conferenceid=46

'''Upload your talks at the link above and then edit the agenda below to enter the appropriate link to your talk based on its DocDB document number'''.

Example (from the May 2008 meeting):
* 15:15 Session I (90) --- Offline Working Group Meeting --- Chair Curtis Meyer
** (20) [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1042 Offline Software Status] -- D. Lawrence

-----

== AGENDA ==

== Thursday October 8, 2015 ==
* 8:30 Session Ia (110) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=569 Opening Session] - Chair: Matt Shepherd
** 8:30 (10) --- Welcome --- Curtis Meyer
** 8:40 (30) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2830 Hall-D Update] --- Eugene Chudakov
** 9:10 (10) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2831 JLab IT] --- Amber Boehnlein
** 9:20 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2829 Accelerator Update] --- Todd Satogata
** 9:45 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2835 Engineering Update] --- Tim Whitlatch
** 10:00 (30) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2826 Electronics Update] --- Fernando Barbosa
* 10:30 (30) Coffee
* 11:00 Session Ib (100) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=571 DAQ/Trigger/Electronics] - (Organizer:David L. ) Chair: - Elton Smith
** 11:00 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2828 FA125 Firmware] --- Naomi Jarvis
** 11:20 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2836 DAQ Status] --- Sergey Furletov
** 11:40 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2837 L1 Trigger Status] --- Alex Somov
** 12:00 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2834 Control Status] --- Hovanes Egiyan
** 12:20 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2827 Online/L3 Status] --- David Lawrence
* 12:40 (80) Lunch
* 12:40 (80) Collaboration Board Meeting
* 14:00 Session IIa (125) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=576 Beamline and Tagger I] - (Organizer: R. Jones ) - Chair: - Zisis Papandreou
** 14:00 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2839 Polarimeter Update] --- Kei Moriya
** 14:25 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2840 Diamond radiator fabrication] --- Brendan Pratt
** 14:50 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2841 Tagger microscope calibration] --- Alex Barnes
** 15:15 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2842 Tagger hodoscope calibration] --- Nathan Sparks
** 15:40 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2843 Pair Spectrometer calibration] --- Alex Somov
* 16:05 (20) Coffee
* 16:25 Session IIb (50) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=577 Beamline and Tagger II] - (Organizer: R. Jones ) - Chair: - Zisis Papandreou
** 16:25 (25) --- [http://argus.phys.uregina.ca/gluex/DocDB/0028/002833/001/beamline_radmon_spring15.pdf Photon beam monitoring survey] --- Alexandre Deur
** 16:50 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2848 Fast feedback control system] --- Trent Allison
* 17:15 Session Iic (50) --- Run Planning for Fall 2015 - (Organizer: ) Chair: Kei Moriya
** 17:15 (50) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2832 Discussion/All]
** [[Run Coordination Meetings: Fall 2015 Run]]
* 18:15 --- Reception in the Atrium

== Friday October 9, 2015 ==
* 9:00 Session IIIa (75) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=576 Calorimeters] - (Organizer: Zisis Papandreou) - Chair: Beni Zihlmann
** 9:00 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2849 BCAL Status, Simulations and Analysis] -- Zisis Papandreou
** 9:25 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2850 BCAL Calibration/Reconstruction] -- Mark Dalton
** 9:50 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2851 FCAL Update] -- Adesh Subedi
* 10:15 (30) Coffee
* 10:45 Session IIIb (100) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=581 Tracking] - (Organizer: Naomi Jarvis) - Chair: Justin Stevens
** 10:45 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2852 CDC Update] -- Mike Staib
** 11:10 (15) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2825 Tracking Update] -- Simon Taylor
** 11:25 (10) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2838 Tracking Efficiency and Resolution] -- Paul Mattione
** 11:35 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2853 FDC Update] -- Lubomir Pentchev
** 12:00 (25) --- Transition Radiation Detector -- Sergey Furletov
* 12:25 Lunch (95)
* 14:00 Session IVa (150) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=584 Offline/Analysis] (Organizer: Mark Ito ) Chair: Dave Mack
** 14:00 (25) --- [http://argus.phys.uregina.ca/cgi-bin/public/DocDB/ShowDocument?docid=2845 Overview] -- Mark Ito
** 14:25 (25) --- [http://argus.phys.uregina.ca/cgi-bin/public/DocDB/ShowDocument?docid=2844 Software Version Management System] -- Mark Ito
** 14:50 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2847 Software/Analysis Tutorial: γp → pω] -- Paul Mattione
** 15:15 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2854 Offline Monitoring and SWIF] -- Kei Moriya
** 15:40 (25) --- Geant4 Development Update -- Richard Jones
** 16:05 (25) --- Calibration Status -- Sean Dobbs
* 16:30 Coffee (30)
* 17:00 Session IVb (90) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=586 Particle ID] - (Organizer: Justin Stevens ) - Chair: Sean Dobbs
** 17:00 (15) --- Start Counter Efficiencies -- Mahmoud Kamel
** 17:15 (15) --- Start Counter Calibration & Performance -- Eric Pooser
** 17:30 (30) --- [http://argus.phys.uregina.ca/cgi-bin/public/DocDB/ShowDocument?docid=2846 TOF Performance] -- Brad Cannon
** 18:00 (30) --- DIRC Update -- John Hardin

== Saturday October 10, 2015 ==
* 9:00 Session Va (120) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=588 Physics I] - (Organizer: Volker Crede) - Chair: David Lawrence
** 9:00 (30) --- Update on GlueX analysis efforts -- Justin Stevens
** 9:30 (30) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2820 Survey of multi-photon final states from the Spring data] -- Simon Taylor
** 10:00 (30) --- ω Photoproduction off Nuclei: Updates and plans for a PAC proposal -- Alexander Somov
** 10:30 (30) --- Recent results on meson spectroscopy from CLAS -- Paul Eugenio
* 11:00 (20) Coffee
* 11:20 Session Vb (30) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=588 Physics II] - (Organizer: Volker Crede) - Chair: David Lawrence
** 11:20 (15) --- Omega Decays -- Michael Staib
** 11:35 (15) --- Eta Decays -- Will McGinley
* 11:50 (30) Session Vc (30) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=592 Business Meeting] - Chair: Matt Shepherd
** 11:50 (25) --- Report from the collaboration board - David Lawrence
** 12:15 (15) --- Moving forward and closeout - Curtis Meyer
* 12:30 Adjourn

GlueX-Collaboration-Oct-2015

2015-10-08T17:38:11Z

Kmoriya: /* Thursday October 8, 2014 */

== GlueX Collaboration Meeting ==

October 8 to 10, 2015 at Jefferson Lab

The following template has been set up to allow people to identify what needs to be presented at the meeting. It has been roughly broken out by working group with suggestions for topics within each group. The working group chairs should work on adding the relevant talks with an estimated time (including questions) and the speaker's name. If there is a talk that does not fit into this template, please add it at the bottom.

== Registration ==
Everyone participating in the collaboration meeting whether in person at JLab or remotely via electronic media are encouraged to register. Please visit the
[https://misportal.jlab.org/Ul/conferences/generic_conference/registration.cfm?conference_id=COLLAB-GLUEX-OCT2015 Registration Page].

To see who else will be attending either in person or via electronic media, please see the
[https://misportal.jlab.org/Ul/conferences/generic_conference/participants.cfm?conference_id=COLLAB-GLUEX-OCT2015 Current List of Participants Page]

== Location ==
CEBAF Center L102

== Remote Access ==
# To join via a Web Browser, go to the page [https://bluejeans.com/660743227] https://bluejeans.com/660743227.
# To join via Polycom room system go to the IP Address: 199.48.152.152 ([http://bjn.vc bjn.vc]) and enter the meeting ID: 660743227.
# To join via phone, use one of the following numbers and the Conference ID: 660743227.
#* US or Canada: +1 408 740 7256 or
#* US or Canada: +1 888 240 2560
# More information on connecting to [[Connect to the Data Challenge Meetings|bluejeans]] is available.

== Talks on the DocDB ==

The DocDB talk-upload template for the October 2015 Collaboration Meeting can be found here:

http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?conferenceid=46

'''Upload your talks at the link above and then edit the agenda below to enter the appropriate link to your talk based on its DocDB document number'''.

Example (from the May 2008 meeting):
* 15:15 Session I (90) --- Offline Working Group Meeting --- Chair Curtis Meyer
** (20) [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=1042 Offline Software Status] -- D. Lawrence

-----

== AGENDA ==

== Thursday October 8, 2014 ==
* 8:30 Session Ia (110) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/DisplayMeeting?sessionid=499 Opening Session] - Chair: Matt Shepherd
** 8:30 (10) --- Welcome --- Curtis Meyer
** 8:40 (30) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2830 Hall-D Update] --- Eugene Chudakov
** 9:10 (10) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2831 JLab IT] --- Amber Boehnlein
** 9:20 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2829 Accelerator Update] --- Todd Satogata
** 9:45 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2835 Engineering Update] --- Tim Whitlatch
** 10:00 (30) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2826 Electronics Update] --- Fernando Barbosa
* 10:30 (30) Coffee
* 11:00 Session Ib (100) --- DAQ/Trigger/Electronics - (Organizer:David L. ) Chair: - Elton Smith
** 11:00 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2828 FA125 Firmware] --- Naomi Jarvis
** 11:20 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2836 DAQ Status] --- Sergey Furletov
** 11:40 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2837 L1 Trigger Status] --- Alex Somov
** 12:00 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2834 Control Status] --- Hovanes Egiyan
** 12:20 (20) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2827 Online/L3 Status] --- David Lawrence
* 12:40 (80) Lunch
* 12:40 (80) Collaboration Board Meeting
* 14:00 Session IIa (160) --- Beamline and Tagger - (Organizer: R. Jones ) - Chair: - Zisis Papandreou
** 14:00 (25) --- Diamond radiator fabrication --- Brendan Pratt
** 14:25 (25) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2839 Polarimeter Update] --- Kei Moriya
** 14:50 (25) --- Tagger hodoscope calibration --- Nathan Sparks
** 15:15 (25) --- Tagger microscope calibration --- Alex Barnes
** 15:40 (25) --- Pair Spectrometer calibration --- Alex Somov
* 16:05 (20) Coffee
** 16:25 (25) --- [http://argus.phys.uregina.ca/gluex/DocDB/0028/002833/001/beamline_radmon_spring15.pdf Photon beam monitoring survey] --- Alexandre Deur
** 16:50 (25) --- Fast feedback control system --- Trent Allison
* 17:15 Session IIb (50) --- Run Planning for Fall 2015 - (Organizer: ) Chair: Kei Moriya
** 17:15 (50) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2832 Discussion/All]
** [[Run Coordination Meetings: Fall 2015 Run]]
* 18:15 --- Reception in the Atrium

== Friday October 9, 2015 ==
* 9:00 Session IIIa (75) --- Calorimeters - (Organizer: Zisis Papandreou) - Chair: Beni Zihlmann
** 9:00 (25) --- BCAL Status, Simulations and Analysis -- Zisis Papandreou
** 9:25 (25) --- BCAL Calibration/Reconstruction -- Mark Dalton
** 9:50 (25) --- FCAL Update -- Adesh Subedi
* 10:15 (30) Coffee
* 10:45 Session IIIb (100) --- Tracking - (Organizer: Naomi Jarvis) - Chair: Justin Stevens
** 10:45 (25) --- CDC Update -- Mike Staib
** 11:10 (15) --- Tracking Update -- Simon Taylor
** 11:25 (10) --- [http://argus.phys.uregina.ca/cgi-bin/private/DocDB/ShowDocument?docid=2838 Tracking Efficiency and Resolution] -- Paul Mattione
** 11:35 (25) --- FDC Update -- Lubomir Pentchev
** 12:00 (25) --- Transition Radiation Detector -- Sergey Furletov
* 12:25 Lunch (95)
* 14:00 Session IVa (150) --- Offline/Analysis (Organizer: Mark Ito ) Chair: Dave Mack
** 14:00 (25) --- Overview -- Mark Ito
** 14:25 (25) --- Software Version Management System -- Mark Ito
** 14:50 (25) --- Software/Analysis Tutorial: γp → pω -- Paul Mattione
** 15:15 (25) --- Offline Monitoring and SWIF -- Kei Moriya
** 15:40 (25) --- Geant4 Development Update -- Richard Jones
** 16:05 (25) --- Calibration Status -- Sean Dobbs
* 16:30 Coffee (30)
* 17:00 Session IVb (90) --- Particle ID - (Organizer: Justin Stevens ) - Chair: Sean Dobbs
** 17:00 (15) --- Start Counter Efficiencies -- Mahmoud Kamel
** 17:15 (15) --- Start Counter Calibration & Performance -- Eric Pooser
** 17:30 (30) --- TOF Performance -- Brad Cannon
** 18:00 (30) --- DIRC Update -- John Hardin

== Saturday October 10, 2015 ==
* 9:00 Session Va (120) --- Physics - (Organizer: Volker Crede) - Chair: David Lawrence
** 9:00 (30) --- Update on GlueX analysis efforts -- Justin Stevens
** 9:30 (30) --- Survey of multi-photon final states from the Spring data -- Simon Taylor
** 10:00 (30) --- ω Photoproduction off Nuclei: Updates and plans for a PAC proposal -- Alexander Somov
** 10:30 (30) --- Recent results on meson spectroscopy from CLAS -- Paul Eugenio
* 11:00 (20) Coffee
** 11:20 (15) --- Omega Decays -- Michael Staib
** 11:35 (15) --- Eta Decays -- Will McGinley
* 11:50 (30) Session Vb (90) --- Business Meeting - Chair: Matt Shepherd
** 11:50 (25) --- Report from the collaboration board - David Lawrence
** 12:15 (15) --- Moving forward and closeout - Curtis Meyer
* 12:30 Adjourn

GlueX Offline Meeting, September 30, 2015

2015-09-30T17:41:03Z

Kmoriya: /* Agenda */

GlueX Offline Software Meeting 
Wednesday, September 30, 2015 
1:30 pm EDT 
JLab: CEBAF Center F326/327

==Agenda==

# Announcements
## [https://github.com/orgs/JeffersonLab/teams?utf8=%E2%9C%93&query=%40markito3 Team Maintainers and Admins] (Mark)
## [[GlueX-Collaboration-Oct-2015|Collaboration Meeting]] October 8-10, 2015 at Jefferson Lab
# Review of [[GlueX Offline Meeting, September 16, 2015#Minutes|minutes from September 16]] (all)
# Geant4 Update (Richard, David)
# [https://halldweb1.jlab.org/wiki/images/e/e5/2015-09-30-offline_monitoring.pdf Offline Monitoring (Kei)] [https://halldweb.jlab.org/data_monitoring/launch_analysis/tmp/summary_swif_output_offline_monitoring_RunPeriod2015_03_ver15_hd_rawdata.html SWIF output]
# [[Data Challenge 3]] update (Mark)
# [[Spring 2015 Commissioning Simulations]]
# Fall 2015 Commissioning Simulations (all)
# Auto-Build on Pull Request (Sean)
# [https://halldweb.jlab.org/wiki/images/b/b9/Sdobbs_OfflineMtg_20150930.pdf Noise Studies] (Sean)
# [[Automatic Tests of GlueX Software|b1pi results review]]
# Review of [https://github.com/JeffersonLab/sim-recon/pulls?q=is%3Aopen+is%3Apr recent pull requests]
#* comments on merge
#* alternate workflows for submitting pull requests
#* rebasing?
# Action Item Review

==Communication Information==

===Remote Connection===

* The BlueJeans meeting number is 968 592 007 .
* [http://bluejeans.com/968592007 Join the Meeting] via BlueJeans

===Slides===

Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2015</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2015/ .

GlueX Offline Meeting, September 30, 2015

2015-09-30T17:32:52Z

Kmoriya: /* Agenda */

GlueX Offline Software Meeting 
Wednesday, September 30, 2015 
1:30 pm EDT 
JLab: CEBAF Center F326/327

==Agenda==

# Announcements
## [https://github.com/orgs/JeffersonLab/teams?utf8=%E2%9C%93&query=%40markito3 Team Maintainers and Admins] (Mark)
## [[GlueX-Collaboration-Oct-2015|Collaboration Meeting]] October 8-10, 2015 at Jefferson Lab
# Review of [[GlueX Offline Meeting, September 16, 2015#Minutes|minutes from September 16]] (all)
# Geant4 Update (Richard, David)
# [https://halldweb1.jlab.org/wiki/images/e/e5/2015-09-30-offline_monitoring.pdf Offline Monitoring (Kei)]
# [[Data Challenge 3]] update (Mark)
# [[Spring 2015 Commissioning Simulations]]
# Fall 2015 Commissioning Simulations (all)
# Auto-Build on Pull Request (Sean)
# [https://halldweb.jlab.org/wiki/images/b/b9/Sdobbs_OfflineMtg_20150930.pdf Noise Studies] (Sean)
# [[Automatic Tests of GlueX Software|b1pi results review]]
# Review of [https://github.com/JeffersonLab/sim-recon/pulls?q=is%3Aopen+is%3Apr recent pull requests]
#* comments on merge
#* alternate workflows for submitting pull requests
#* rebasing?
# Action Item Review

==Communication Information==

===Remote Connection===

* The BlueJeans meeting number is 968 592 007 .
* [http://bluejeans.com/968592007 Join the Meeting] via BlueJeans

===Slides===

Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2015</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2015/ .

File:2015-09-30-offline monitoring.pdf

2015-09-30T17:32:23Z

Kmoriya: Offline monitoring report by Kei for offline meeting 2015/09/30

Offline monitoring report by Kei for offline meeting 2015/09/30

GlueX Offline Meeting, September 16, 2015

2015-09-16T17:26:36Z

Kmoriya: /* Agenda */

GlueX Offline Software Meeting 
Wednesday, September 16, 2015 
1:30 pm EDT 
JLab: CEBAF Center F326/327

==Agenda==

# Announcements
## [https://mailman.jlab.org/pipermail/halld-offline/2015-September/002144.html "git_update" simple email list] (Mark)
## [[Scripts for Installing GlueX Software|gluex_install]] moved to GitHub (Mark)
## [https://mailman.jlab.org/pipermail/halld-offline/2015-September/002143.html ROOT TTree Format Overhaul] (Paul)
## [https://halldweb.jlab.org/wiki/images/1/15/2015-09-16-offline_monitoring.pdf Offline Monitoring] (Kei)
# Review of [[GlueX Offline Meeting, September 2, 2015#Minutes|minutes from September 2]] (all)
# [[GlueX-Collaboration-Oct-2015|Collaboration Meeting]] October 8-10, 2015 at Jefferson Lab
# Geant4 Update (Richard, David)
# [[Data Challenge 3]] update (Mark)
# [[Spring 2015 Commissioning Simulations]]
# Fall 2015 Commissioning Simulations (all)
# Auto-Build on Pull Request (Sean)
# Review of [https://github.com/JeffersonLab/sim-recon/pulls?q=is%3Aopen+is%3Apr recent pull requests]
# Action Item Review

==Communication Information==

===Remote Connection===

* The BlueJeans meeting number is 968 592 007 .
* [http://bluejeans.com/968592007 Join the Meeting] via BlueJeans

===Slides===

Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2015</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2015/ .

File:2015-09-16-offline monitoring.pdf

2015-09-16T17:26:09Z

Kmoriya: Talk about offline monitoring 2015-03 ver14 given at offline meeting September 16 2015 by Kei

Talk about offline monitoring 2015-03 ver14 given at offline meeting September 16 2015 by Kei