Difference between revisions of "Online Monitoring Shift"

From Hall D Ops Wiki
Jump to: navigation, search
Line 2: Line 2:
  
 
[[Image:20141021_CommissioningMonitoringArchitecture.png| thumb | 400px | Fig. 1. Online Monitoring for Fall 2014 Commissioning run.]]
 
[[Image:20141021_CommissioningMonitoringArchitecture.png| thumb | 400px | Fig. 1. Online Monitoring for Fall 2014 Commissioning run.]]
 
[[Image:20140528_L3MonitoringArchitecture_noL3.png| thumb | 400px | Fig. 2. Online Monitoring Architecture when a Level-3 trigger is inactive. A single "L3" process will still be present and operating in pass-through mode. Note that monitoring is done on the "post-L3" stream to allow the algorithm to set flags in the data stream indicating that pass-through mode was used.]]
 
 
[[Image:20140528_L3MonitoringArchitecture.png| thumb | 400px | Fig. 3. Online Monitoring and L3 Architecture when a Level-3 trigger is active.]]
 
  
 
[[Image:20141021_hdmongui_py.png| thumb | 400px | Fig. 4. ''hdmongui.py'' screen. Start this from the hdops account by simply typing ''hdmongui.py'' in a terminal.]]
 
[[Image:20141021_hdmongui_py.png| thumb | 400px | Fig. 4. ''hdmongui.py'' screen. Start this from the hdops account by simply typing ''hdmongui.py'' in a terminal.]]
  
The Online Monitoring System is a software system that couples with the Data Acquisition System to monitor the quality of the data as it is read in. The system is responsible for ensuring that the detector systems are producing data of sufficient quality that a successful analysis of the data in the offline is likely and capable of producing a physics result.
+
The Online Monitoring System is a software system that couples with the Data Acquisition System to monitor the quality of the data as it is read in. The system is responsible for ensuring that the detector systems are producing data of sufficient quality that a successful analysis of the data in the offline is likely and capable of producing a physics result. The system itself does not contain alarms or checks on the data. Rather, it supplies histograms and relies on shift takers to periodically inspect them to assure all detectors are functioning properly.
  
Events will be transported across the network via the ET (Event Transfer) system developed and used as part of the DAQ architecture. The configuration of the processes and nodes are shown in Figs 1-3. Fig. 1 shows the simplest case that will be used for the Fall 2014 commissioning run. In this case, no L3 nodes will be used. Figure 2 shows the case where a L3 node in "pass-through" mode is used. For this case, only a single L3 node should be needed. Fig. 3 shows the more complicated case of when a Level-3 (L3) trigger algorithm is actively rejecting events. In this case, both the pre-L3 and post-L3 event streams must be monitored to help record what is being discarded by the algorithm.  
+
Events will be transported across the network via the ET (Event Transfer) system developed and used as part of the DAQ architecture. The configuration of the processes and nodes are shown in Fig. 1.
  
 
== Routine Operation ==
 
== Routine Operation ==
Line 17: Line 13:
 
=== Starting and stopping the Monitoring System ===
 
=== Starting and stopping the Monitoring System ===
  
The monitoring system should be automatically started and stopped by the DAQ system whenever a new run is started or ended (see [[Data Acquisition Shift | Data Acquisition]] for details on how to do that.) Shift workers may start or stop the monitoring system by hand if needed. This should be done from the ''hdops'' account by running either the ''start_monitoring'' or ''stop_monitoring'' script. One can also do it via buttons on the ''hdmongui.py'' program (see fig. 4.)
+
The monitoring system should be automatically started and stopped by the DAQ system whenever a new run is started or ended (see [[Data Acquisition Shift | Data Acquisition]] for details on how to do that.) Shift workers may start or stop the monitoring system by hand if needed. This should be done from the ''hdops'' account by running either the ''start_monitoring'' or ''stop_monitoring'' script. One can also do it via buttons on the ''hdmongui.py'' program (see Fig. 2.)
  
These scripts may be run from any gluon computer since they will automatically launch multiple programs on the appropriate computer nodes. If processes are already running on the nodes then new ones are not started so it is safe to run ''start_monitoring'' multiple times. To check the status of the monitoring system run the ''hdmongui.py'' program as shown in fig. 4. A summary is given in the following table:
+
These scripts may be run from any gluon computer since they will automatically launch multiple programs on the appropriate computer nodes. If processes are already running on the nodes then new ones are not started so it is safe to run ''start_monitoring'' multiple times. To check the status of the monitoring system run the ''hdmongui.py'' program as shown in Fig. 2. A summary is given in the following table:
  
 
{|class="wikitable" | width=600px
 
{|class="wikitable" | width=600px
Line 26: Line 22:
 
! Action
 
! Action
 
|-
 
|-
| '''start_monitoring''' || Starts all programs required for the the online monitoring system. WARNING: This will kill any existing monitoring processes before restarting them.
+
| '''start_monitoring''' || Starts all programs required for the the online monitoring system.  
 
|-
 
|-
 
| '''stop_monitoring''' || Stops all monitoring processes
 
| '''stop_monitoring''' || Stops all monitoring processes
Line 38: Line 34:
 
=== Viewing Monitoring Histograms ===
 
=== Viewing Monitoring Histograms ===
  
Live histograms may be viewed using the ''RootSpy'' program. Start it from the ''hdops'' account on any gluon node (NOTE: For the Fall 2015 run you must ssh into gluon47 to run RootSpy). It will communicate with all histogram producer programs on the network and start cycling through a subset of them for shift workers to monitor. Users can turn off the automatic cycling and select different histograms to display using the GUI itself.  
+
Live histograms may be viewed using the ''RootSpy'' program. Start it from the ''hdops'' account on any gluon node. It will communicate with all histogram producer programs on the network and start cycling through a subset of them for shift workers to monitor. Users can turn off the automatic cycling and select different histograms to display using the GUI itself.  
  
'''Resetting Histograms''': The RootSpy GUI has a pair of buttons labeled ''Reset'' and ''Un-reset''. The first will reset the local copies of all histograms displayed in all pads of the current canvas. This does *not* affect the histograms in the monitoring processes and therefore has no affect on the archive ROOT file. Copies of the reset histograms are made just before resetting so that the histograms may be "Un-reset". This feature allows one to periodically reset any display without stopping the program or disrupting the archive. "Un-reset"-ing allows one to return to viewing the full statistics.
+
'''Resetting Histograms''': The RootSpy GUI has a pair of buttons labeled ''Reset'' and ''Un-reset''. The first will reset the local copies of all histograms displayed in all pads of the current canvas. This does *not* affect the histograms in the monitoring processes and therefore has no affect on the archive ROOT file. What this actually does is save a copy in memory of the existing histogram(s) and subtracts them from what it receives from the producers before displaying them as the run progresses. This feature allows one to periodically reset any display without stopping the program or disrupting the archive. "Un-reset"-ing just deletes the copies, allowing one to return to viewing the full statistics.
  
 
{|class="wikitable" | width=600px
 
{|class="wikitable" | width=600px
Line 51: Line 47:
 
|}
 
|}
  
 
==Quick Primer for Experts==
 
 
This section describes some basics of the system to give new and old experts a brief overview/reminder of the important files and scripts in the system.
 
 
* When CODA starts a run, it runs the script
 
*: ''$DAQ_HOME/scripts/run_prestart''
 
*: which in turn runs
 
*: ''/gluex/builds/devel/$BMS_OSNAME/bin/start_monitoring''
 
* The ''start_monitoring'' script is run with an argument ''-RXXX'' where ''XX'' is the run number
 
** This is actually run first with ''-RXX -e'' to kill any existing processes and then without the ''-e'' to (re)start everything
 
** The ET system parameters are obtained from the COOL configuration using the ''coolutils'' python module which is in ''/home/hdops/CDAQ/daq_dev_v0.31/daq/tools/pymods''
 
** One may use ''start_monitoring'' by hand with a EVIO file (full path) for testing
 
* The configuration of which nodes are used for monitoring etc. is given in ''$DAQ_HOME/config/monitoring/nodes.conf''
 
** This specifies nodes and "levels" which are actually freeform strings corresponding to jana config files in the same directory (with same name, but ".conf" suffix).
 
** The nodes.conf file also specifies where a secondary ET system should be run and which levels should use it. This is usually all levels.
 
*** The ''start_monitoring'' script will translate host names to the Infiniband name or IP address so that ET connections are all done using IB.
 
* Status of the monitoring system can be monitored using ''hdmongui.py''
 
** This communicates to all processes using the janactl plugin via the cMsg server on gluondb1 (n.b. RootSpy hists are communicated via a different cMsg server)
 
* The ROOTSpy system itself can be monitored using the ''RSMonitor'' program. Note that this works by subscribing to all cMsg messages so will tend to double the RootSpy traffic (and increase accordingly for every additional instance.)
 
 
 
== Advanced Details of the Monitoring System ==
 
 
The online monitoring consists primarily of generating numerous histograms that can be viewed by shift takers or analyzed automatically by macros to check the data quality. The system is therefore comprised of histogram producers and consumers.
 
 
=== Producers ===
 
These are produced by a set of plugins, each representing a different detector or online system. The plugins are attached to processes running on multiple computers in the counting house. The nodes used will vary depending on whether the DAQ is configured to run a L3 trigger and how many nodes are required by the algorithm being run. The node names will be in the pool specified as "L3" in the list maintained on the [https://halldweb1.jlab.org/wiki/index.php/HallD_Online_IP_Name_And_Address_Conventions HallD Online IP Name And Address Conventions] page of the GlueX wiki.
 
The monitoring processes will be started and killed automatically by the DAQ system via scripts attached to state transitions.
 
 
The definitions of the histograms are ultimately the responsibility of the detector or online system experts.
 
 
===Consumers ===
 
The primary consumer of the histograms will the [http://www.jlab.org/RootSpy RootSpy] system. This has both a GUI interface for shift-takers to monitor and an archiver that can be used to store histograms in files for later viewing. To start the viewer, simply type "RootSpy" from the command line in the [https://halldweb1.jlab.org/wiki/index.php/Policies_for_Using_Online_Directories_and_Accounts hdops account]. The ''RSArchiver'' program is a command line tool used to gather histograms from the RootSpy producers and archive them in a ROOT file. This file will be copied automatically by a DAQ system script to the RAID disk alongside the raw data so that it is stored on tape with the data.
 
 
 
== Accessing Onsite Webpages From Offsite ==
 
 
Some webpages are not accessible from outside the JLab network. To get to these from offsite, you'll need to setup an ssh tunnel using your CUE account. [[Accessing Onsite Webpages From Offsite]] gives an example of how to run web browser from a VNC session on a machine at JLab so you can access internal web pages.
 
  
 
== Expert personnel ==
 
== Expert personnel ==
 +
Expert details on the Online Monitoring system can be [[Online Monitoring Expert|here]].
 
The individuals responsible for the Online Monitoring  are shown in following table.
 
The individuals responsible for the Online Monitoring  are shown in following table.
 
Problems with normal operation of the Online Monitoring should be referred to those individuals and any changes to their settings must be
 
Problems with normal operation of the Online Monitoring should be referred to those individuals and any changes to their settings must be

Revision as of 11:01, 15 September 2016

The Online Monitoring System

Fig. 1. Online Monitoring for Fall 2014 Commissioning run.
Fig. 4. hdmongui.py screen. Start this from the hdops account by simply typing hdmongui.py in a terminal.

The Online Monitoring System is a software system that couples with the Data Acquisition System to monitor the quality of the data as it is read in. The system is responsible for ensuring that the detector systems are producing data of sufficient quality that a successful analysis of the data in the offline is likely and capable of producing a physics result. The system itself does not contain alarms or checks on the data. Rather, it supplies histograms and relies on shift takers to periodically inspect them to assure all detectors are functioning properly.

Events will be transported across the network via the ET (Event Transfer) system developed and used as part of the DAQ architecture. The configuration of the processes and nodes are shown in Fig. 1.

Routine Operation

Starting and stopping the Monitoring System

The monitoring system should be automatically started and stopped by the DAQ system whenever a new run is started or ended (see Data Acquisition for details on how to do that.) Shift workers may start or stop the monitoring system by hand if needed. This should be done from the hdops account by running either the start_monitoring or stop_monitoring script. One can also do it via buttons on the hdmongui.py program (see Fig. 2.)

These scripts may be run from any gluon computer since they will automatically launch multiple programs on the appropriate computer nodes. If processes are already running on the nodes then new ones are not started so it is safe to run start_monitoring multiple times. To check the status of the monitoring system run the hdmongui.py program as shown in Fig. 2. A summary is given in the following table:

Program Action
start_monitoring Starts all programs required for the the online monitoring system.
stop_monitoring Stops all monitoring processes
hdmongui.py Starts graphical interface for monitoring the Online Monitoring system
start_hdview2 Starts graphical event viewer with the correct parameters to connect to current run

Viewing Monitoring Histograms

Live histograms may be viewed using the RootSpy program. Start it from the hdops account on any gluon node. It will communicate with all histogram producer programs on the network and start cycling through a subset of them for shift workers to monitor. Users can turn off the automatic cycling and select different histograms to display using the GUI itself.

Resetting Histograms: The RootSpy GUI has a pair of buttons labeled Reset and Un-reset. The first will reset the local copies of all histograms displayed in all pads of the current canvas. This does *not* affect the histograms in the monitoring processes and therefore has no affect on the archive ROOT file. What this actually does is save a copy in memory of the existing histogram(s) and subtracts them from what it receives from the producers before displaying them as the run progresses. This feature allows one to periodically reset any display without stopping the program or disrupting the archive. "Un-reset"-ing just deletes the copies, allowing one to return to viewing the full statistics.

Program Action
RootSpy Starts RootSpy GUI for viewing live monitoring histograms


Expert personnel

Expert details on the Online Monitoring system can be here. The individuals responsible for the Online Monitoring are shown in following table. Problems with normal operation of the Online Monitoring should be referred to those individuals and any changes to their settings must be approved by them. Additional experts may be trained by the system owner and their name and date added to this table.

Table: Expert personnel for the Online Monitoring system
Name Extension Date of qualification
David Lawrence 269-5567 May 28, 2014