Difference between revisions of "Data Monitoring Procedures"

Revision as of 12:40, 7 February 2015

Saving Online Monitoring Data

The procedure for writing the data out is given in, e.g., Raid-to-Silo Transfer Strategy.

Once the DAQ writes out the data to the raid disk, cron jobs will copy the file to tape, and within ~20 min., we will have access to the file on tape at /mss/halld/$RUN_PERIOD/rawdata/RunXXXXXX.

All online monitoring plugins will be run as data is taken. They will be accessible within the counting house via RootSpy, and for each run and file, a ROOT file containing the histograms will be saved within a subdirectory for each run.

For immediate access to these files, the raid disk files may be accessed directly from the counting house, or the tape files will be available within ~20 min. of the file being written out.

Launching and Tracking Monitoring Jobs (NEW)

Overview

Database Overview

Script Overview

Log into the ifarm machine with one of the gxproj accounts

ssh gxproj1@ifarm -Y

Go to the project scripts folder and add the perl script directory to the current $PATH environment variable:

cd ~/halld/jproj/scripts/
source setup.csh

Come up with a name for your job submission project. It will be a unique identifier for the current set of job submissions. For example, for the 10th pass over the 10/2014 data for the offline monitoring:

offmon_rp2014m10_v10

However, the output file name format changed during the 10/2014 commissioning run (hd_raw_* --> hd_rawdata_*). Since these scripts assume a fixed file name format, for these runs an additional identifier should be used, e.g.:

offmon_rp2014m10_v10_type1, offmon_rp2014m10_v10_type2

Copy and rename an existing set of project files to create new project files for your project(s). For example:

cd ~/halld/jproj/projects/
cp -r offmon_rp2014m10_v10_type1 offmon_rp2014m10_v11_type1
cp -r offmon_rp2014m10_v10_type2 offmon_rp2014m10_v11_type2

An overview of each project file:

 - clear.sh: For the current project, deletes the job status and management database tables (if any), and creates new, empty ones. 
 - <project_name>.jproj: 
 - <project_name>.jsub
 - script.sh
 - setup_jlab.csh
 - status.sh

An overview of the job management database:

An overview of the job status database:

For each project, descend into the new directory, and make changes to each file so that it will work for your project. These changes typically include:
- Changing the project name (e.g. offmon_rp2014m10_v10_type1 --> offmon_rp2014m10_v11_type1) in both the .jproj and .jsub file names, and in the contents of each file.
- If the run period has changed, update it in the contents of each file (e.g. RunPeriod-2014-10 --> RunPeriod-2015-01).
- If the version number has changed, update it in the contents of the .jsub file.
- If the path or file name format for the input files have changed, update them in the .jproj and .jsub files.
- Any other changes to the execution script, environment variables, or job submission instructions can be made in the appropriate files.

Delete (if any) and create the database table(s) for the current set of job submissions:

./clear.sh

Search for input files matching the string in the .jproj file, and create a row for each in the job management database. You can test by adding an optional argument at the end, which only selects files with a specific file number:

jproj.pl offmon_rp2014m10_v10_type1 update <optional_file_number>

Confirm that the job management database is accurate by printing it's contents to screen:

mysql -hhalldweb1 -ufarmer farming -e "select * from offmon_rp2014m10_v10_type1"

ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:

./clear.sh

Submit the unsubmitted jobs in the job management database, and add their job ids to the job status database:

jproj.pl offmon_rp2014m10_v10_type1 submit

To look at the status of the submitted jobs, first query auger and update the job status database:

fill_in_job_details.pl offmon_rp2014m10_v10_type1

The job status can then be viewed by submitting a query to the job status database:

mysql -hhalldweb1 -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from offmon_rp2014m10_v10_type1Job"

These last two commands can be executed simultaneously by running:

./status.sh

Handy mysql instructions:

mysql -hhalldweb1 -ufarmer farming # Enter the "farming" mysql database on "halldweb1" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from offmon_rp2014m10_v10_type1; # show all of the columns for the given table
select * from offmon_rp2014m10_v10_type1; # show the contents of all rows from the given table

Running Over Data As It Comes In

A special user gxproj1 will have a cron job set up to run the plugins as new data appears on /mss. During the week, gxproj1 will submit offline plugin jobs with the same setup as the weekly jobs run the previous Friday. The procedure for this is shown below.

Setting up the environment

The file

/home/gxproj1/setup_jlab.csh

is sourced through .tcshrc. This file is the same as what is linked to by

/home/gluex/setup_jlab_commissioning.csh,

except HALLD_HOME, HDDS_HOME, and JANA_CALIB_URL are set separately so that this user can have a separate build.

To obtain the builds from the previous Friday's runs, execute

/home/gxproj1/halld/monitoring/newruns/setup_previous.sh [year] [month] [day]

The build revisions from the previous Friday are archived in files

/work/halld/data_monitoring/run_conditions/soft_comm_[year]_[month]_[day].xml

and the script will build libraries based on those stored revision numbers.

Running the cron job

To run the cron job go to

/u/home/gxproj1/halld/monitoring/newruns

and do

crontab cron_plugins

To check whether the cron job is running, do

crontab -l

To remove the cron job do

crontab -r

The cron job will run the script scan_for_jobs.sh, which runs generatejobs_plugins_rawdata.sh for any new runs that it had not seen before. All previous runs are recorded in the file filelists/files_current.txt so clear this to run over runs, or set the parameters MINRUN and MAXRUN which will set the range of runs submitted.

Running Over Archived Data

Once the files are written to take we can run the online plugins on these files to confirm what we were seeing in the online monitoring. Manual scripts, and cron jobs are set up to look for new run numbers and run the plugin over a sample of files.

Details of Offline Monitoring

Below are the procedures to

run a single offline plugin job manually
launch weekly runs
run a cron job to automate the process for new files

In principle these scripts should work, but if there are changes in the directory structure for the rawdata files, or if there is a significant increase in the memory or disk space necessary for the jobs, these should be modified.

Generating an offline plugin job

The user gxproj1 should be used for official offline monitoring jobs. Within /home/gxproj1/halld/monitoring/batch/ there will be scripts to run the online monitoring plugins over tape files. The main script is generatejobs_plugins_rawdata.sh, which can be used as

generatejobs_plugins_rawdata.sh [minrun] [maxrun] (minfile) (maxfile)

where minrun, maxrun specify the range of the run #, and minfile, maxfile (optional) specify the file range for that run.

This will generate a script run_rawdata_XXXXXX.sh, where the run # has now been formatted to be 6 digits. Executing this script will send the monitoring plugins job to the Auger batch system.

There is also a script clean.sh which can be used as

clean.sh XXX

This will clean up all associated files created in association with run XXX.

Internally, the xml file used to submit the job will be created, and the job to run will be given within script.sh. All run parameters should be specified in at the beginning of generatejobs_plugins_rawdata.sh Since we are running on tape, the tape file will first be copied over to the cache disk, and the job will run over this cached file.

Launch weekly runs

Every Friday, jobs will be started to run the newest software on all previous runs. This is done using the gxproj1 account. See details below in Procedures of Running Offline Monitoring.

Using cron to run automatically

Within /home/gluex/halld/monitoring/cron/ there is a file cron_plugins that can be executed via

crontab cron_plugins

This will set up a cron job to call the script scan_for_jobs.sh, which will check in the rawdata directory and call generatejobs_plugins_rawdata.sh for any run that is more than 5 min old. The cron job is set up to run every 10 min.

Magnetic Field Settings

For tracking, it is necessary to set the correct magnetic field settings. The below table is now obsolete. Please refer to the GlueX Run Conditions webpage.

The actual field settings for each run have not been documented well. Below are values based on going through entries in the halld log.

Run #s	Solenoid Current (A)	JANA option	notes
940 - 996	1000	-PBFIELD_MAP=Magnets/Solenoid/solenoid_1000A_poisson_20141104	At run 997, the solenoid started ramping down
998 - 1448	0	-PBFIELD_TYPE=NoField -PDEFTAG:DTrackCandidate=StraightLine	Solenoid started ramping up around run 1431. Runs 1432 - 1448 should not be used for tracking. See this and this entry.
1449 - 1620	1200	-PBFIELD_MAP=Magnets/Solenoid/solenoid_1200A_poisson_20140520

Note: The following run ranges were taken with a 300A field, and could be analyzed with the magnetic field map Magnets/Solenoid/solenoid_300A_poisson_20140819 , more study is needed to see if the straight line track fitter is needed or not: 1036 - 1053, 1065 - 1121, 1212? - 1254, 1309 - 1318 [1307,1308,1319-1329 were taken while the magnet was ramping].

Procedures of Running Offline Monitoring

This section is mostly just for documentation and is intended for the person who will run the jobs periodically (currently Kei). Currently we are thinking of running the jobs every Friday at the end of the day. This allows everybody to see improvements in each detector over the week.

Acquire build lock for user gluex. This means sending out an email to Paul, Sean, Mark I, Kei to not make change in svn directories for user gluex, or to try to build anything until the jobs are finished.
We will run the jobs from user gxproj1 as of December 12 2014 on an independent build. This is the account that will process runs as they come in. To stop this, first kill the cron job with cron -r. Also, delete all jobs that have not started yet. If these jobs are still alive, they will cause confusion over which software version they ran.
svn update & rebuild HDDS
svn update & rebuild sim-recon
svn update & rebuild monitoring hists
~~Prepare the latest sqlite file & update the JANA_CALIB_URL environment variable with~~

cp /group/halld/www/halldweb1/html/dist/ccdb.sqlite /group/halld/www/halldweb1/html/dist/ccdb_2014-MM-DD.sqlite

Set the sqlite file's time context within /home/gxproj1/setup_jlab.csh which is automatically read in from /home/gxproj1/.tcshrc

setenv JANA_CALIB_CONTEXT "calibtime=YYYY-MM-DDT00:00"

Submit jobs. Currently we need to set the magnetic field settings by hand within the script generatejobs_plugins_rawdata.sh.
~~When jobs finish, unlock. This means sending another email to Paul, Sean, Mark I, Kei.~~
Restart cron jobs for immediate processing of runs coming in. Make sure to update magnetic field settings here also.
Contact Sean to notify that there are new runs available in /volatile for copying over to the /work disk.

Extracting Summary Data

For high-level monitoring, we save images of selected histograms and store time series of selected quantities in a database, which are then displayed on a web page. This section describes how to generate the monitoring images and database information.

The scripts used to generate this summary data are currently kept in /u/home/gluex/halld/monitoring/process Note that these scripts currently have some parameters which must be periodically set by hand.

The default python version on most JLab machine does not have the modules to allow these scripts to connect to the MySQL database. To run these scripts, load the environment with the following command

source /u/home/gluex/halld/monitoring/process/monitoring_env.sh

Online Monitoring

There are two scripts for running over the monitoring data generated by the online system and offline reconstruction. The online script can be run with either of the following commands:

./check_new_runs.py
 
OR 
 
./check_new_runs.csh

The shell script sets up the environment properly to run the python script. To connect to the monitoring database on the JLab CUE, modules continued in the installation of python >= 2.7 are needed. The shell script is appropriate to use in a cron job.

The online monitoring system copies a ROOT file containing the results of the online monitoring, and other configuration files into a directory accessible outside the counting house. This python script automatically checks for new ROOT files, which it will then automatically process. It contains several configuration variables that must be correctly set, which contains the location of input/output directories, etc...

Note that while this script is current run as a cronjob, the processing of online ROOT files is currently disabled, so its only function it to update the run_info database.

Offline Monitoring

After the data is run over, the results should be processed, so that summary data is entered into the monitoring database and plots are made for the monitoring webpages. Currently, this processing is controlled by a cronjob that runs the following script:

/home/gluex/halld/monitoring/process/check_monitoring_data.csh

This script checks for new ROOT files, and only runs over those it hasn't processed yet. Since one monitoring ROOT file is produced for each EVIO file, whenever a new file is produced, the plots for the corresponding run are recreated and all the ROOT files for a run are combined into one file. Information is stored in the database on a per-file basis.

Plots for the monitoring web page can be made from single histograms or multiple histograms using RootSpy macros. If you want to change the list of plots made, you must modify one of the following files:

histograms_to_monitor - specify either the name of the histogram or its the full ROOT path
macros_to_monitor - specify the full path to the RootSpy macro .C file

When a new monitoring run is started, or the conditions are changed, the following steps should be taken to process the new files:

Add a new data version, as described below:
Change the following parameters in check_monitoring_data.csh:
1. JOBDATE should correspond to the ouptut date used by the job submission script
2. OUTPUTDIR should correspond to the directory corresponding to the run period and revision corresponding to the new version you just submitted. Presumably, this directory will be empty at the beginning.
3. Once you create a new data version as defined below, you should pass the needed information as a command line option. Currently this is done by the ARGS variable. For example, the argument "-v RunPeriod-2014-10,8" tells the monitoring scripts to look up the version corresponding to revision 8 of RunPeriod-2014-10 in the monitoring DB and to use to store the results.

Example configuration parameters:
set JOBDATE=2015-01-09
set INPUTDIR=/volatile/halld/RunPeriod-2014-10/offline_monitoring
set OUTPUTDIR=/w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver08
set ARGS=" -v RunPeriod-2014-10,8 "

If you want to process the results manually, the data is processed using the following script:

./process_new_offline_data.py <input directory> <output directory>
 
EXAMPLE:
 
./process_new_offline_data.py 2014-11-14 /volatile/halld/RunPeriod-2014-10/offline_monitoring/ /w/halld-scifs1a/data_monitoring/RunPeriod-2014-10/ver02

The python script takes several options to enable/disable various steps in the processing. Of interest is the "--force" option, which will run over all monitoring ROOT files, whether or not they've been previously identified.

Every time a new reconstruction pass is performed, a new version number must be generated. To do this, prepare a version file as described below. Then run the register_new_version.py script to store the information in the database. The script will return a version number, which then should be set by hand in process_new_offline_data.py - future versions of the script will streamline this part of the procedure. An example of how to generate a new version is:

./register_new_version.py add /u/home/gluex/halld/monitoring/process/versions/vers_RunPeriod-2014-10_pass1.txt

Run Conditions

Currently the run_info database is being updated by Sean by hand. Note that this must be done inside the counting house. If you want to do this yourself, check out the monitoring scripts on a gluon machine

svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/process/

In the process/get_conds directory, run the process_runlog_files.py script with the maximum and minimum run number that you want to process, e.g.

./process_runlog_files.py -b 2200 -e 2260

Data Versions

To document the conditions of the monitoring data that is created, for the sake of reproducability and further analysis we save several pieces of information. The format is intended to be comprehensive enough to document not just monitoring data, but versions of raw and reconstructed data, so that this database table can be used for the event database as well.

We store one record per pass through one run period, with the following structure:

Field	Description
data_type	The level of data we are processing. For the purposes of monitoring, "rawdata" is the online monitoring, "recon" is the offline monitoring
run_period	The run period of the data
revision	An integer specifying which pass through the run period this data corresponds to
software_version	The name of the XML file that specifies the different software versions used
jana_config	The name of the text file that specifies which JANA options were passed to the reconstruction program
ccdb_context	The value of JANA_CALIB_CONTEXT, which specifies the version of calibration constants that were used
production_time	The data at which monitoring/reconstruction began
dataVersionString	A convenient string for identifying this version of the data

An example file used as as input to ./register_new_version.py is:

data_type           = recon
run_period          = RunPeriod-2014-10
revision            = 1
software_version    = soft_comm_2014_11_06.xml
jana_config         = jana_rawdata_comm_2014_11_06.conf
ccdb_context        = calibtime=2014-11-10
production_time     = 2014-11-10
dataVersionString   = recon_RunPeriod-2014-10_20141110_ver01

@@ Line 43: / Line 43: @@
 </pre>
-* However, the output file name changed during the 10/2014 commissioning run.  Since these scripts assume a fixed file name format, for these runs an additional identifier should be used, e.g.:
+* However, the output file name format changed during the 10/2014 commissioning run (hd_raw_* --> hd_rawdata_*).  Since these scripts assume a fixed file name format, for these runs an additional identifier should be used, e.g.:
 <pre>
 offmon_rp2014m10_v10_type1, offmon_rp2014m10_v10_type2