Online Data Challenge 2013 Log

From GlueXWiki
Jump to: navigation, search

Here is a record of the Hall-D Online Data Challenge Exercise done in the Hall-D Counting House August 26-29.

August 26, 2013

  • Today was spent finalizing some setup including a number of scripts to start and stop processes on multiple systems
  • We also finished getting Sean registered and issued a badge so that he has access to the counting house un-escorted

August 27, 2013

Executing run plan:

1. Start/stop scripts verified and are now in routine use. We noticed that ET systems can often leave system in a state that makes it hard to restart properly. The speculation is that the et system file is not actually being deleted and restarting tries to use this (unsuccessfully). Start/stop scripts updated to remove et system file just before starting system.

2. 100Hz ran fine. CPU on all nodes maxed out because of JANA spinning when starved for events rather than sleeping.

August 28, 2013

3. Added time delay to JANA NextEvent so CPU usage reported will be more accurate.

  • All L3 nodes and monitoring nodes are run with NTHREADS=Ncores
  • Event rate set at evio2et to be 1ms delay resulting in 920-960Hz
  • gluon42: running L3 only with has between 85%-95% CPU usage (90% idle)
  • gluon20: running preL3 monitoring only has 250% CPU usage (70% idle)
    • ~5.5 MB/s going into each of gluon20,gluon21, and gluon22 which are the only 3 preL3 monitoring nodes. Load appears to be balanced between nodes
  • gluon23: running postL3 monitoring only has 300%-310% CPU (63% idle) this should be the same as preL3 since both are running the same plugins and generating the same histogram. This may be because the other postL3 nodes are also running L3 which pushes more of the postL3 mon load onto gluon23 which is *not* running L3.
  • We noticed the rate into the preL3 monitor ET system was slightly smaller than the postL3 monitor ET system (~800Hz vs.900Hz).
    • Taking gluon44 and gluon45 out of the L3 farm cured the discrepancy.
    • We tested if just gluon44 or just gluon45 was responsible and the effect was due to both
  • Killing L3 processes on other nodes caused the percent idle on one of the remaining nodes to decrease slightly.

4. Turned off both monitoring et2et so we could test L3 throughput.

  • With full parsing + pass-through L3 we got 14.3kHz.
    • turned off each L3 process and recorded rates. First turned off the 8-core 1.9GHz AMDs, then the 8core+8ht 2.53GHz Intel. (there were 8 nodes of the first type and 2 nodes of the second type)
      • 14.3kHz, 13.3kHz, 12.3kHz, 11.1kHz, 10.1kHz, 9.1kHz, 8.1kHz, 7.1kHz, 6.1kHz, 3.1kHz, 0kHz
  • Turned off DOM tree creation and event unpacking in hdl3 processes. Rate went up to ~31kHz and CPU usage down to about 1core per node (100% via top)
    • Noticed asymmetry in preL3 and postL3 ET systems, but opposite to what was observed in item 3 above. Here, the asymmetry was very pronounced (preL3=15.7kHz, postL3=9.6kHz)

August 29, 2013

Post- Data Challenge

Some things not recorded during the challenge but reproduced afterwards.

  • L3 rates using MIT's BDT algorithm: in=~7.2kHz out=~1.5kHz