Difference between revisions of "GlueX Data Challenge Meeting, February 28, 2014"

From GlueXWiki
Jump to: navigation, search
(Agenda)
m (Text replacement - "/halldweb1.jlab.org/" to "/halldweb.jlab.org/")
 
(5 intermediate revisions by 3 users not shown)
Line 13: Line 13:
 
#* Random number seeds procedure?  Mark/Anyone
 
#* Random number seeds procedure?  Mark/Anyone
 
#* Running jobs at CMU Paul
 
#* Running jobs at CMU Paul
#* Running jobs at NU Sean - [https://halldweb1.jlab.org/wiki/images/f/f8/DC2-Meeting-sdobbs-20140228.pdf slides]
+
#* Running jobs at NU Sean - [https://halldweb.jlab.org/wiki/images/f/f8/DC2-Meeting-sdobbs-20140228.pdf slides]
 
#* Running jobs at MIT Justin
 
#* Running jobs at MIT Justin
#** 100 jobs with 10K events each, no EM background: only 2 failed (with DTrackFitterKalmanSIMD)
+
#* Running jobs at JLab
#* 100-job tests at various sites. [[Quick Look at DC 2|Jlab]]/Grid/CMU,NWU, ....
+
#* Electromagnetic Backgrounds update.  Paul/[https://halldweb.jlab.org/wiki/images/e/ea/2014-02-28-DC2.pdf Kei]  
#* Electromagnetic Backgrounds update.  Paul/[https://halldweb1.jlab.org/wiki/images/e/ea/2014-02-28-DC2.pdf Kei]  
+
 
#* Check on event genealogy. Kei
 
#* Check on event genealogy. Kei
 
#* Preparations of standard distribution/scripts Mark/Richard/Sean
 
#* Preparations of standard distribution/scripts Mark/Richard/Sean
Line 29: Line 28:
  
 
Present:
 
Present:
 +
* '''CMU''': Paul Mattione, Curtis Meyer
 +
* '''FSU''': Volker Crede, Priyashree Roy, Aristeidis Tsaris,
 +
* '''IU''': Kei Moriya
 +
* '''JLab''': Mark Dalton, Mark Ito (chair), Chris Larrieu, Simon Taylor
 +
* '''MIT''': Justin Stevens
 +
* '''NU''': Sean Dobbs
 +
* '''UConn''': Richard Jones
 +
 +
==Announcements==
 +
 +
* Mark announced an [https://mailman.jlab.org/pipermail/halld-offline/2014-February/001511.html update of the branch]. Changes include:
 +
*# I fix from Simon for single-ended TOF counters.
 +
*# Improvements from Paul for cutting off processing for multi-lap curling tracks.
 +
*# A change from David Lawrence to [https://mailman.jlab.org/pipermail/halld-offline/2014-February/001512.html allow compression to be turned off] in producing REST format data.
 +
*#* David noticed that all three of the programs hdgeant, mcsmear, and DANA produced HDDM-like output, but only DANA has compression turned on (REST data in this case). This feature will allow us to test if this has anything to do with short REST files. On a side note, David reported that the short-REST-file was not reproducible. Mark produced some example hdgeant_smeared.hddm files that produced short output for him to test.
 +
 +
===Running Jobs at JLab===
 +
 +
Mark has submitted some test jobs against the new branch. [Added in press: 1,000 50 k-event jobs have been submitted.]
 +
 +
==Status of Preparations==
 +
 +
===Random number seeds procedure===
 +
 +
Paul spoke to David about this. It seems that mcsmear is currently generating its own random number seed. We still have details to fill in on this story.
 +
 +
===Running Jobs at FSU===
 +
 +
FSU has started running data challenge test jobs on their cluster. Aristeidis has started with 50 jobs, but an early look shows problems with some of them in hd_root. Also there was the GlueX-software-induced crash of the FSU cluster[?].
 +
 +
===Running jobs at CMU===
 +
 +
Paul is seeing ZFATAL errors from hdgeant. He will send a bug report to Richard who will look into a fix beyond merely increasing ZEBRA memory.
 +
 +
Richard asked about an issue where JANA takes a long time to identify a CODA file as not an HDDM file. Richard would like to fix the HDDM parser such that this is not the case. Mark D. will send Richard an example.
 +
 +
===Running Jobs at NU===
 +
 +
Sean regaled us with tales of site specific problems.
 +
 +
Lots of jobs crashed at REST generation. Site configuration changes helped. But there were still a lot of jobs hanging, usually with new nodes. Reducing the number of submit slots fixed most of the problems. Many of the remaining symptoms were jobs hung on the first event when accessing the magnetic field. Jobs are single-threaded. [https://halldweb.jlab.org/wiki/images/f/f8/DC2-Meeting-sdobbs-20140228.pdf Some statistics on the results] were presented as well.
 +
 +
Richard remarked that on the OSG, jobs will start much faster if declared as single-threaded.
 +
 +
Richard proposed the following standards:
 +
 +
BGRATE 1.1 (equivalent to 10<sup>7</sup>)
 +
BGGATE -800 800 (in ns, time gate for EM background addition)
 +
 +
We agreed on these as standard settings.
 +
 +
Mark proposed the following split of running:
 +
 +
15% with no EM background
 +
70% with EM background corresponding to 10<sup>7</sup>
 +
15% with EM background corresponding to 5\&times;10<sup>7</sup>
 +
 +
There was general agreement; adjustment may happen in the future.
 +
 +
===Running Jobs at MIT===
 +
 +
Justin has been running with the dc-2.2 tag. The OpenStack cluster at MIT has about 180 cores and he has been running jobs for a couple of days with good success. BGGATE was set at -200 to 200.
 +
 +
==Electromagnetic Background==
 +
 +
Kei gave us an [https://halldweb.jlab.org/wiki/images/e/ea/2014-02-28-DC2.pdf update] on his studies of EM background with hdds-2.0 and sim-recon-dc-2.1. Slides covered:
 +
* Memory Usage
 +
* CPU time
 +
* mcsmear File Sizes
 +
* REST File Sizes
 +
* Another Bad File
 +
* Sum of parentid=0
 +
* Correlation of CDC hits
 +
* Correlation of FDC hits
 +
* p&pi;<sup>+</sup>&pi;<sup>-</sup> Events
 +
 +
==Proposed Schedule==
 +
 +
The schedule has slipped. The new schedule is as follows:
 +
 +
# Launch of Data Challenge Thursday March 6, 2014 (est.).
 +
# Test jobs going successfully by Tuesday March 4.
 +
# Distribution ready by Monday March 3.
 +
 +
Justin pointed out that the short REST file problem might be something that we could live with for this data challenge.
 +
 +
Richard asked that Mark assign run numbers and run conditions for the various sites.
 +
 +
==Action Items==
 +
 +
# Understand random number seed system.
 +
# Solve ZFATAL crashes.
 +
# Make a table of conditions vs. sites where the entries are assigned file numbers.
 +
# Report the broken Polycom in L207.

Latest revision as of 11:10, 31 March 2015

GlueX Data Challenge Meeting
Friday, February 28, 2014
11:00 pm, EST
JLab: CEBAF Center, L207
ESNet: 8542553
ReadyTalk: (866)740-1260, access code: 1833622
ReadyTalk desktop: http://esnet.readytalk.com/ , access code: 1833622

Agenda

  1. Announcements
  2. Status of Preparations
    • Random number seeds procedure? Mark/Anyone
    • Running jobs at CMU Paul
    • Running jobs at NU Sean - slides
    • Running jobs at MIT Justin
    • Running jobs at JLab
    • Electromagnetic Backgrounds update. Paul/Kei
    • Check on event genealogy. Kei
    • Preparations of standard distribution/scripts Mark/Richard/Sean
  3. Proposed Schedule
    • Launch of Data Challenge Thursday March 6, 2014 (est.).
    • Test jobs going successfully by Tuesday March 4.
    • Distribution ready by Monday March 3.
  4. AOT

Minutes

Present:

  • CMU: Paul Mattione, Curtis Meyer
  • FSU: Volker Crede, Priyashree Roy, Aristeidis Tsaris,
  • IU: Kei Moriya
  • JLab: Mark Dalton, Mark Ito (chair), Chris Larrieu, Simon Taylor
  • MIT: Justin Stevens
  • NU: Sean Dobbs
  • UConn: Richard Jones

Announcements

  • Mark announced an update of the branch. Changes include:
    1. I fix from Simon for single-ended TOF counters.
    2. Improvements from Paul for cutting off processing for multi-lap curling tracks.
    3. A change from David Lawrence to allow compression to be turned off in producing REST format data.
      • David noticed that all three of the programs hdgeant, mcsmear, and DANA produced HDDM-like output, but only DANA has compression turned on (REST data in this case). This feature will allow us to test if this has anything to do with short REST files. On a side note, David reported that the short-REST-file was not reproducible. Mark produced some example hdgeant_smeared.hddm files that produced short output for him to test.

Running Jobs at JLab

Mark has submitted some test jobs against the new branch. [Added in press: 1,000 50 k-event jobs have been submitted.]

Status of Preparations

Random number seeds procedure

Paul spoke to David about this. It seems that mcsmear is currently generating its own random number seed. We still have details to fill in on this story.

Running Jobs at FSU

FSU has started running data challenge test jobs on their cluster. Aristeidis has started with 50 jobs, but an early look shows problems with some of them in hd_root. Also there was the GlueX-software-induced crash of the FSU cluster[?].

Running jobs at CMU

Paul is seeing ZFATAL errors from hdgeant. He will send a bug report to Richard who will look into a fix beyond merely increasing ZEBRA memory.

Richard asked about an issue where JANA takes a long time to identify a CODA file as not an HDDM file. Richard would like to fix the HDDM parser such that this is not the case. Mark D. will send Richard an example.

Running Jobs at NU

Sean regaled us with tales of site specific problems.

Lots of jobs crashed at REST generation. Site configuration changes helped. But there were still a lot of jobs hanging, usually with new nodes. Reducing the number of submit slots fixed most of the problems. Many of the remaining symptoms were jobs hung on the first event when accessing the magnetic field. Jobs are single-threaded. Some statistics on the results were presented as well.

Richard remarked that on the OSG, jobs will start much faster if declared as single-threaded.

Richard proposed the following standards:

BGRATE 1.1 (equivalent to 107) BGGATE -800 800 (in ns, time gate for EM background addition)

We agreed on these as standard settings.

Mark proposed the following split of running:

15% with no EM background 70% with EM background corresponding to 107 15% with EM background corresponding to 5\×107

There was general agreement; adjustment may happen in the future.

Running Jobs at MIT

Justin has been running with the dc-2.2 tag. The OpenStack cluster at MIT has about 180 cores and he has been running jobs for a couple of days with good success. BGGATE was set at -200 to 200.

Electromagnetic Background

Kei gave us an update on his studies of EM background with hdds-2.0 and sim-recon-dc-2.1. Slides covered:

  • Memory Usage
  • CPU time
  • mcsmear File Sizes
  • REST File Sizes
  • Another Bad File
  • Sum of parentid=0
  • Correlation of CDC hits
  • Correlation of FDC hits
  • +π- Events

Proposed Schedule

The schedule has slipped. The new schedule is as follows:

  1. Launch of Data Challenge Thursday March 6, 2014 (est.).
  2. Test jobs going successfully by Tuesday March 4.
  3. Distribution ready by Monday March 3.

Justin pointed out that the short REST file problem might be something that we could live with for this data challenge.

Richard asked that Mark assign run numbers and run conditions for the various sites.

Action Items

  1. Understand random number seed system.
  2. Solve ZFATAL crashes.
  3. Make a table of conditions vs. sites where the entries are assigned file numbers.
  4. Report the broken Polycom in L207.