Online Data Challenge 2013 Log

From GlueXWiki
Revision as of 10:19, 28 August 2013 by Davidl (Talk | contribs)

Jump to: navigation, search

Here is a record of the Hall-D Online Data Challenge Exercise done in the Hall-D Counting House August 26-29.

August 26, 2013

  • Today was spent finalizing some setup including a number of scripts to start and stop processes on multiple systems
  • We also finished getting Sean registered and issued a badge so that he has access to the counting house un-escorted

August 27, 2013

Executing run plan:

1. Start/stop scripts verified and are now in routine use. We noticed that ET systems can often leave system in a state that makes it hard to restart properly. The speculation is that the et system file is not actually being deleted and restarting tries to use this (unsuccessfully). Start/stop scripts updated to remove et system file just before starting system.

2. 100Hz ran fine. CPU on all nodes maxed out because of JANA spinning when starved for events rather than sleeping.


August 28, 2013

3. Added time delay to JANA NextEvent so CPU usage reported will be more accurate.

  • All L3 nodes and monitoring nodes are run with NTHREADS=Ncores
  • Event rate set at evio2et to be 1ms delay resulting in 920-960Hz
  • gluon42: running L3 only with has between 85%-95% CPU usage (90% idle)
  • gluon20: running preL3 monitoring only has 250% CPU usage (70% idle)
    • ~5.5 MB/s going into each of gluon20,gluon21, and gluon22 which are the only 3 preL3 monitoring nodes. Load appears to be balanced between nodes
  • gluon23: running postL3 monitoring only has 300%-310% CPU (63% idle) this should be the same as preL3 since both are running the same plugins and generating the same histogram. This may be because the other postL3 nodes are also running L3 which pushes more of the postL3 mon load onto gluon23 which is *not* running L3.
  • We noticed the rate into the preL3 monitor ET system was slightly smaller than the postL3 monitor ET system (~800Hz vs.900Hz).
    • Taking gluon44 and gluon45 out of the L3 farm cured the discrepancy.
    • We tested if just gluon44 or just gluon45 was responsible and the effect was due to both
  • Killing L3 processes on other nodes caused the percent idle on one of the remaining nodes to decrease slightly.

August 29, 2013