Offline Monitoring Post Processing
- 1 Overview
- 2 Online Monitoring
- 3 Offline Monitoring
To visualize the monitoring data, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on the monitoring web pages. The results from different raw data files in a run are also combined in to single ROOT file per run, and other bookkeeping tasks are performed. This section describes how to generate the monitoring images and database information.
The post-processing scripts generally perform the following steps for each run:
- Summarize monitoring information from each EVIO file, store this information in a database
- Merge the monitoring ROOT files into a single file for the run
- Generate summary monitoring information for the run and store it in a database
- Generate summary monitoring plots and store these in a web-accessible location
The scripts used to generate this summary data are primarily run from /home/gxprojN/monitoring/process i.e. the same account from which the monitoring launch was performed. If you want a new copy of the scripts, e.g., for a new monitoring run, you should check the scripts out from SVN:
svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process
Note that these scripts depend on standard GlueX environment definitions to load the python modules needed to access MySQL databases and to process ROOT files.
When a DAQ run ends, the online monitoring system pushes two pieces of data to the lustre file system: ROOT files containing histograms from the online monitoring system and a text file containing some run condition information.
This data is processed by a cron script run under the "gluex" account that runs the following script:
This script runs the check_new_runs.py program which generates summary information. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc... The run meta-data processing is deprecated in favor of information from the RCDB.
Starting a new run period
- First, create a new data version.
- Note: need to update how this data version is set
- Consider updating the run range which is being scanned
- This range should also be updated when data taking actually begins
- The list of runs that have already been processed is stored in the file processedrun.lst.online. The processing is keyed off of the existence of a new ROOT file from the online monitoring system. However, before data taking, the monitoring system is not always run. We don't mark runs as processed if the ROOT file doesn't exist in order to handle cases in which there are delays in copying the ROOT file from the online side. So, the run range needs to be managed by hand.
This sections gives instruction for post-processing different types of monitoring data. Generally, each process is driven by one program which has several configuration parameters stored near the top of the file which need to be set. The post-processing is done on the batch farm except when processing incoming data.
The configuration options for each type of data are generally configured in the scripts described in each section. For generating monitoring plots, there are two additional files. One plot is generated for each of the entries in these files:
- histograms_to_monitor - Histogram name or full path in ROOT file
- macros_to_monitor - Full pathnames for ROOT macros to execute
Incoming Monitoring Data
The monitoring jobs that are run over incoming data are post-processed using the check_monitoring_data.csh script via a cron job run as the gxproj5 user on ifarm1401. This script should only be used to process ver01 monitoring data.
Offline Monitoring Launch
The post-processing for a monitoring launch involves merging histograms, creating plots for display on the web, and putting summary information into a database. The processing for each run is performed by check_monitoring_data.batch.sh. The directory structure and options used can be changed by modifying this file.
The jobs are submitted using the script submit_batch.py. This is a general driver program for submitting post-processing jobs. There are several variables that should be set:
- Workflow name, e.g., "offmon_2016-02_ver08_post"
- Data type, e.g., "mon"
- Data version, e.g., "08",
- Run period, e.g., "RunPeriod-2016-02"
- Post-processing command, e.g., "check_recon_data.batch.csh"
The log files are currently stored in /volatile/halld/home/gxproj5/process/batch_log
Note that the jobs are multithreaded because the histogram merging is multi-threaded; a two-stage merge is performed to obtain better performance when merging what can be over 100 files.
The post-processing for a reconstruction launch involves merging various types of outputs so that each has 1 file per run. The outputs that are currently merged are ROOT files containing (1) monitoring histograms, and (2) ROOT trees. EVIO files are currently not merged. The processing for each run is performed by check_recon_data.batch.sh. The directory structure and lists of files to be merged can be modified by editing this file.
The batch jobs are submitted using the submit_batch.py script as described above.
Currently the only post-processing needed for the analysis launch is to merge the ROOT files that contain histograms into one file per run.
The script that does this job is merge_analysis_hists.py. Edit this script to have the correct run period and version, and then run it.
This part of the system is out-of-date and will be updated after the next simulation launch.
The post-processing is driven by the process_new_offline_data.py script. The processing of the different types of data have many similar features: merging files on a run-by-run basis, traversing similar directory structures, etc. This script takes several options to enable the various types of processing that may need to be done in each case.
Several subsystems are used to perform major processing steps:
- make_monitoring_plots.py - Generates plots in PNG format for web display
- phadd.py - 2 stage multithreaded ROOT file merging
- summarize_monitoring_data.py - Collects statistics and makes fits to histograms to find occupancies, resolutions, etc.
The common structure for the offline batch job scripts is assumed in the post-processing. Some comments follow:
- Input directories
- Most output files are stored on the /cache disk under a directory path like /cache/halld/$RUNPERIOD/$DATATYPE.
- Smaller files, such as log files, can be stored in a different location (e.g. /work/halld2), and can be optionally tar'd and stored on the /cache disk for more permanent storage
- Output directories
- Monitoring outputs are placed in two web-accessible directories.
- The location for most files (e.g. monitoring plots) is /work/halld2/data_monitoring/
- The /work/halld2 disk is limited in size, so the merged ROOT files are put in /work/halld/data_monitoring/
- We try to limit the number of web-accessible files stored on lustre disks due to their instability - one lustre disk timing out can make the whole webserver freeze.
- Merged analysis ROOT files are currently stored in /cache/halld/$RUNPERIOD/analysis/$VERSION/hists
Details on the options taken by process_new_offline_data.py are given below in a more-detailed version of the online help. [still being updated]
ifarm1401> python process_new_offline_data.py Usage: process_new_offline_data.py input_directory output_directory Options: -h, --help show this help message and exit -p, --disable_plots Don't make PNG files for web display -d, --disable_summary Don't calculate summary information and store it in the DB -s, --disable_hadd Don't sum ouptut histograms into one combined file. -f, --force Ignore list of already processed runs -R RUN_NUMBER, --run_number=RUN_NUMBER Process only this particular run number -V VERSION_NUMBER, --version_number=VERSION_NUMBER Save summary results with this DB version ID -v VERSION_STRING, --version=VERSION_STRING Save summary results with a particular data version, specified using the string "RunPeriod,Revision", e.g., "RunPeriod-2014-10,5" -b MIN_RUN, --min_run=MIN_RUN Minimum run number to process -e MAX_RUN, --max_run=MAX_RUN Maximum run number to process -L LOGFILE, --logfile=LOGFILE Base file name to save logs to -t NTHREADS, --nthreads=NTHREADS Number of threads to use -A, --parallel Enable parallel processing. -S, --save_rest Save REST files to conventional location. -M, --merge-incrementally Merge ROOT files incrementally and delete old ones. -E, --no-end-of-job-processing Disable end of run processing. --merge-trees=ROOT_TREES_TO_MERGE Merge these ROOT trees. --merge-skims=EVIO_SKIMS_TO_MERGE Merge these EVIO skims. -T ROOT_OUTPUT_DIR, --merged-root-output-dir=ROOT_OUTPUT_DIR Directory to save merged ROOT files
To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.
We store one record per pass through one run period, with the following structure:
|data_type||The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring|
|run_period||The run period of the data|
|revision||An integer specifying which pass through the run period this data corresponds to|
|software_version||The name of the XML file that specifies the different software versions used|
|jana_config||The name of the text file that specifies which JANA options were passed to the reconstruction program|
|ccdb_context||The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used|
|production_time||The data at which monitoring/reconstruction began|
|dataVersionString||A convenient string for identifying this version of the data|
An example file used as as input to ./register_new_version.py is:
data_type = recon run_period = RunPeriod-2014-10 revision = 1 software_version = soft_comm_2014_11_06.xml jana_config = jana_rawdata_comm_2014_11_06.conf ccdb_context = calibtime=2014-11-10 production_time = 2014-11-10 dataVersionString = recon_RunPeriod-2014-10_20141110_ver01