Offline Monitoring Post Processing

From GlueXWiki
Jump to: navigation, search

Overview

To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on the monitoring web pages. The results from different raw data files in a run are also combined in to single ROOT file per run, and other bookkeeping tasks are performed. This section describes how to generate the monitoring images and database information.

The post-processing scripts generally perform the following steps for each run:

  1. Summarize monitoring information from each EVIO file, store this information in a database
  2. Merge the monitoring ROOT files into a single file for the run
  3. Generate summary monitoring information for the run and store it in a database
  4. Generate summary monitoring plots and store these in a web-accessible location

The scripts used to generate this summary data are primarily run from /home/gxprojN/monitoring/process i.e. the same account from which the monitoring launch was performed. If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:

svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process

Note that these scripts depend on standard GlueX environment definitions to load the python modules needed to access MySQL databases and to process ROOT files.

Online Monitoring

When a DAQ run ends, the online monitoring system pushes two pieces of data to the lustre file system: ROOT files containing histograms from the online monitoring system and a text file containing some run condition information.

This data is processed by a cron script run under the "gluex" account that runs the following script:

/home/gluex/halld/monitoring/process/check_new_runs.csh

This script runs the check_new_runs.py program which generates summary information. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc... The run meta-data processing is deprecated in favor of information from the RCDB.

Starting a new run period

  1. First, create a new data version.
    • Note: need to update how this data version is set
  2. Consider updating the run range which is being scanned
    • This range should also be updated when data taking actually begins
    • The list of runs that have already been processed is stored in the file processedrun.lst.online. The processing is keyed off of the existence of a new ROOT file from the online monitoring system. However, before data taking, the monitoring system is not always run. We don't mark runs as processed if the ROOT file doesn't exist in order to handle cases in which there are delays in copying the ROOT file from the online side. So, the run range needs to be managed by hand.

Offline Monitoring

Instructions

This sections gives instruction for post-processing different types of monitoring data. Generally, each process is driven by one program which has several configuration parameters stored near the top of the file which need to be set. The post-processing is done on the batch farm except when processing incoming data.

The configuration options for each type of data are generally configured in the scripts described in each section. For generating monitoring plots, there are two additional files. One plot is generated for each of the entries in these files:

  • histograms_to_monitor - Histogram name or full path in ROOT file
  • macros_to_monitor - Full pathnames for ROOT macros to execute

Incoming Monitoring Data

The monitoring jobs that are run over incoming data are post-processed using the check_monitoring_data.csh script via a cron job run as the gxproj5 user on ifarm1401. This script should only be used to process ver01 monitoring data.

Offline Monitoring Launch

The post-processing for a monitoring launch involves merging histograms, creating plots for display on the web, and putting summary information into a database. The processing for each run is performed by check_monitoring_data.batch.sh. The directory structure and options used can be changed by modifying this file.

The jobs are submitted using the script submit_batch.py. This is a general driver program for submitting post-processing jobs. There are several variables that should be set:

  • Workflow name, e.g., "offmon_2016-02_ver08_post"
  • Data type, e.g., "mon"
  • Data version, e.g., "08",
  • Run period, e.g., "RunPeriod-2016-02"
  • Post-processing command, e.g., "check_recon_data.batch.csh"

The log files are currently stored in /volatile/halld/home/gxproj5/process/batch_log

Note that the jobs are multithreaded because the histogram merging is multi-threaded; a two-stage merge is performed to obtain better performance when merging what can be over 100 files.

Reconstruction Launch

The post-processing for a reconstruction launch involves merging various types of outputs so that each has 1 file per run. The outputs that are currently merged are ROOT files containing (1) monitoring histograms, and (2) ROOT trees. EVIO files are currently not merged. The processing for each run is performed by check_recon_data.batch.sh. The directory structure and lists of files to be merged can be modified by editing this file.

The batch jobs are submitted using the submit_batch.py script as described above.

Analysis Launch

Currently the only post-processing needed for the analysis launch is to merge the ROOT files that contain histograms into one file per run.

The script that does this job is merge_analysis_hists.py. Edit this script to have the correct run period and version, and then run it.

Simulation Launch

This part of the system is out-of-date and will be updated after the next simulation launch.

Details

The post-processing is driven by the process_new_offline_data.py script. The processing of the different types of data have many similar features: merging files on a run-by-run basis, traversing similar directory structures, etc. This script takes several options to enable the various types of processing that may need to be done in each case.

Several subsystems are used to perform major processing steps:

  • make_monitoring_plots.py - Generates plots in PNG format for web display
  • phadd.py - 2 stage multithreaded ROOT file merging
  • summarize_monitoring_data.py - Collects statistics and makes fits to histograms to find occupancies, resolutions, etc.

Directory Structure

The common structure for the offline batch job scripts is assumed in the post-processing. Some comments follow:

  • Input directories
    • Most output files are stored on the /cache disk under a directory path like /cache/halld/$RUNPERIOD/$DATATYPE.
    • Smaller files, such as log files, can be stored in a different location (e.g. /work/halld2), and can be optionally tar'd and stored on the /cache disk for more permanent storage
  • Output directories
    • Monitoring outputs are placed in two web-accessible directories.
    • The location for most files (e.g. monitoring plots) is /work/halld2/data_monitoring/
    • The /work/halld2 disk is limited in size, so the merged ROOT files are put in /work/halld/data_monitoring/
    • We try to limit the number of web-accessible files stored on lustre disks due to their instability - one lustre disk timing out can make the whole webserver freeze.
    • Merged analysis ROOT files are currently stored in /cache/halld/$RUNPERIOD/analysis/$VERSION/hists

Options

Details on the options taken by process_new_offline_data.py are given below in a more-detailed version of the online help. [still being updated]

ifarm1401> python process_new_offline_data.py 
Usage: process_new_offline_data.py input_directory output_directory

Options:
  -h, --help            show this help message and exit
  -p, --disable_plots   Don't make PNG files for web display
  -d, --disable_summary
                        Don't calculate summary information and store it in the DB
  -s, --disable_hadd    Don't sum ouptut histograms into one combined file.
  -f, --force           Ignore list of already processed runs
  -R RUN_NUMBER, --run_number=RUN_NUMBER
                        Process only this particular run number
  -V VERSION_NUMBER, --version_number=VERSION_NUMBER
                        Save summary results with this DB version ID
  -v VERSION_STRING, --version=VERSION_STRING
                        Save summary results with a particular data version, specified using the string "RunPeriod,Revision", e.g., "RunPeriod-2014-10,5"
  -b MIN_RUN, --min_run=MIN_RUN
                        Minimum run number to process
  -e MAX_RUN, --max_run=MAX_RUN
                        Maximum run number to process
  -L LOGFILE, --logfile=LOGFILE
                        Base file name to save logs to
  -t NTHREADS, --nthreads=NTHREADS
                        Number of threads to use
  -A, --parallel        Enable parallel processing.
  -S, --save_rest       Save REST files to conventional location.
  -M, --merge-incrementally
                        Merge ROOT files incrementally and delete old ones.
  -E, --no-end-of-job-processing
                        Disable end of run processing.
  --merge-trees=ROOT_TREES_TO_MERGE
                        Merge these ROOT trees.
  --merge-skims=EVIO_SKIMS_TO_MERGE
                        Merge these EVIO skims.
  -T ROOT_OUTPUT_DIR, --merged-root-output-dir=ROOT_OUTPUT_DIR
                        Directory to save merged ROOT files

Data Versions

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

Field Description
data_type The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
run_period The run period of the data
revision An integer specifying which pass through the run period this data corresponds to
software_version The name of the XML file that specifies the different software versions used
jana_config The name of the text file that specifies which JANA options were passed to the reconstruction program
ccdb_context The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
production_time The data at which monitoring/reconstruction began
dataVersionString A convenient string for identifying this version of the data


An example file used as as input to ./register_new_version.py is:

data_type           = recon
run_period          = RunPeriod-2014-10
revision            = 1
software_version    = soft_comm_2014_11_06.xml
jana_config         = jana_rawdata_comm_2014_11_06.conf
ccdb_context        = calibtime=2014-11-10
production_time     = 2014-11-10
dataVersionString   = recon_RunPeriod-2014-10_20141110_ver01