Online Monitoring Expert

From Hall D Ops Wiki
Jump to: navigation, search

Overview

This page describes some details of the online monitoring system to give new and old experts a brief overview/reminder of the important files and scripts in the system.

Click the link below to see the checklist for online monitoring:

ONLINE MONITORING START OF RUN CHECKLIST


  • When CODA starts a run, it runs the script
    $DAQ_HOME/scripts/run_prestart
    which in turn runs
    start_monitoring
    This is found via PATH, but is normally here : /gluex/builds/devel/$BMS_OSNAME/bin/start_monitoring
  • The start_monitoring script is run with an argument -RXXX where XX is the run number
    • This is actually run first with -RXX -e to kill any existing processes and then without the -e to (re)start everything
    • The ET system parameters are obtained from the COOL configuration using the coolutils python module which is in /home/hdops/CDAQ/daq_dev_v0.31/daq/tools/pymods
    • One may use start_monitoring by hand with a EVIO file (full path) for testing
  • The configuration of which nodes are used for monitoring etc. is given in $DAQ_HOME/config/monitoring/nodes.conf
    • This specifies nodes and "levels" which are actually freeform strings corresponding to jana config files in the same directory (with same name, but prefixed with "hdmon" and a ".conf" suffix).
    • The nodes.conf file also specifies where a secondary ET system should be run and which levels should use it. This is usually all levels.
      • The start_monitoring script will translate host names to the Infiniband name or IP address so that ET connections are all done using IB.
  • Status of the monitoring system can be monitored using hdmongui.py
    • This communicates to all processes using the janactl plugin via the cMsg server on gluondb1 (n.b. RootSpy hists are communicated via a different cMsg server)
  • The ROOTSpy system itself can be monitored using the RSMonitor program. Note that this works by subscribing to all cMsg messages so will tend to double the RootSpy traffic (and increase accordingly for every additional instance.)

Configuration for plugins on gluon machines

  • The JANA configuration files used for the online monitoring process in the counting house can be found on the gluon machines at
    • High level and timing plugins: $DAQ_HOME/config/monitoring/hdmonHIGHLEVEL.conf
    • Occupancy plugins: $DAQ_HOME/config/monitoring/hdmonOCCUPANCY.conf
  • When logged in as hdops, one can run the same code as used in the monitoring processes by using these JANA configurations and running hd_root on gluon100 or similar machines.
  • In case of changes to halld_recon libraries, the hdmon executable has to be relinked. This should only be done by experts:
    • Login as hdsys
    • cd /gluex/builds/devel/packages/monitoring/src/hdmon
    • scons -u install ginstall

Advanced Details of the Monitoring System

The online monitoring consists primarily of generating numerous histograms that can be viewed by shift takers or analyzed automatically by macros to check the data quality. The system is therefore comprised of histogram producers and consumers.

Producers

These are produced by a set of plugins, each representing a different detector or online system. The plugins are attached to processes running on multiple computers in the counting house. The nodes used will vary depending on whether the DAQ is configured to run a L3 trigger and how many nodes are required by the algorithm being run. The node names will be in the pool specified as "L3" in the list maintained on the HallD Online IP Name And Address Conventions page of the GlueX wiki. The monitoring processes will be started and killed automatically by the DAQ system via scripts attached to state transitions.

The original implementation used a set of detector-specific plugins provided by the detector groups. In practice, the number of histograms produced overwhelmed the RootSpy system rendering it unusable. Those plugins are still used in the offline analysis (look at "Data Monitoring" section on right side of the public GlueX wiki). Now, we use only 2 plugins: occupancy_online and highlevel_online which provide a limited set of histograms and macros to summarize detector performance.


Consumers

The primary consumer of the histograms will the RootSpy system. This has both a GUI interface for shift-takers to monitor and an archiver that can be used to store histograms in files for later viewing. To start the viewer, simply type "RootSpy" from the command line in the hdops account. The RSArchiver program is a command line tool used to gather histograms from the RootSpy producers and archive them in a ROOT file. Details are given in the following section.

RSArchiver

This program is launched from the start_monitoring script if it is provided a run number. (If start_monitoring is launched without a "-R" argument then the archiver is not started). The node that it runs on is specified in the "nodes.conf" file and the output directory for the ROOT file is:

/gluex/data/rawdata/curr/rawdata/active/$RUN_PERIOD/rawdata/RunXXXXXX/monitoring/hdmon_onlineXXXXXX.root

The file will be written to tape when the hd_stage_to_tape.py script makes a tar file of the monitoring directory and links that in staging.

This file is also copied to the work disk where it can be accessed by the offline analysis system (as ver00) by the hdonline_rsync.sh script that is run via hdsys cron job on gluonraid1 every 20 minutes.


Expert personnel

The individuals responsible for the Online Monitoring are shown in following table. Problems with normal operation of the Online Monitoring should be referred to those individuals and any changes to their settings must be approved by them. Additional experts may be trained by the system owner and their name and date added to this table.

Table: Expert personnel for the Online Monitoring system
Name Extension Date of qualification
David Lawrence 269-5567 May 28, 2014