GlueX Data Challenge Meeting, March 7, 2014

From GlueXWiki
Jump to: navigation, search

GlueX Data Challenge Meeting
Friday, March 7, 2014
11:00 pm, EST
JLab: CEBAF Center, F326
ESNet: 8542553
SeeVogh: http://research.seevogh.com
ReadyTalk: (866)740-1260, access code: 1833622
ReadyTalk desktop: http://esnet.readytalk.com/ , access code: 1833622

Agenda

  1. Announcements
  2. Review of minutes from last time
  3. Random Number Seeds Procedure (as of 2014-03-07)? Mark/Anyone
  4. ZFATAL fix (Richard)
  5. Short file issue (all)
  6. Running jobs at CMU (Paul)
  7. Running jobs at NU (Sean)
  8. Running jobs at MIT (Justin)
  9. Running jobs at JLab (Mark)
    1. Nodes at JLab
    2. SRM update
  10. Running jobs at FSU (Aristeidis)
  11. Electromagnetic Backgrounds update. Paul/Kei
  12. Run number assignments (Mark)
  13. Proposed Schedule
    • Launch of Data Challenge Thursday March 6, 2014 (est.).
    • Test jobs going successfully by Tuesday March 4.
    • Distribution ready by Monday March 3.
  14. AOT

Minutes

Present:

  • CMU: Paul Mattione
  • FSU: Aristeidis Tsaris
  • JLab: Mark Dalton, Mark Ito (chair), Chris Larrieu, Beni Zihlmann
  • MIT: Justin Stevens
  • NU: Sean Dobbs
  • UConn: Richard Jones

Find a recording of the meeting here.

Random Number Seed Status

Mark led us through his wiki page with notes from an interview with David Lawrence. Changes may be coming.

ZFATAL Fix

Richard gave us a few more details about the fix, beyond his email.

There is a maximum of 64,000 pointers for Zebra memory management. We were broaching this at high beam rate with EM background turned on. Some time ago Beni introduced a change where secondaries are put on the primary stack in order to track their genealogy. This change has proved useful. At the same time, Beni also exempted particles produced in showers in the calorimeters; there are too many of those and their parentage is generally not of interest. With EM background turned on, showers in the beam collimators can occur, but before now no exemption was in place for them. Richard put in this needed exemption.

We agreed that this fixes the ZFATAL issue that Paul had reported previously.

Short File Issue and Non-Reproducible Results

Paul went through his recent email on non-reproducibility of the code. This may or may not be related to the short file issue, but needs addressing in any case.

Chris noted that often results can be non-reproducible due to off-by-one errors, where uninitialized array elements are accessed by mistake.

Justin's sent out email to the group where he concludes that enabling the compression of the REST output is indeed correlated with short REST files. Mark confirms this in his recent running at JLab.

Richard is working on fixing this problem.

Running at CMU

Paul mentioned that the b1pi test is broken again.

He reported job on run times:

photon rate k events hours
No EM 50 12
1×107 25 24
5×107 5 18

We will probably have to cut back on the amount of high intensity running we do.

Running at NU

Sean has been running jobs with 20 k events and seeing execution times of 12 to 18 hours. He also sees short REST files.

Running at MIT

More CPUs have been added to the MIT cluster. In addition, he has started using nodes on FutureGrid, which also uses OpenStack.

Running at JLab

SciComp will be able to move the nodes what we have been lending to the High Performance Computing (HPC) cluster with a few days notice. Once we get going we can ask for more nodes from HPC. The plan is to provide us with 1250 cores.

SciComp is very reluctant to allow us to install the SRM at JLab even if we provide the manpower. They would need to review the security properties in order to approve usage and they do not have the manpower to do that in the near term. They are encouraging us to use Globus Online to transfer files in and out of JLab. Richard told us that Globus Online is not appropriate for our use case, in particular doing transfers in batch mode. Chris asked if the Globus Online command-line tool provides the needed functionality, but Richard did not think that it did.

Running at FSU

Aristeidis presented statistics on jobs he has run at FSU. He also reported some problems with building the latest versions of the code. Richard suggested that he take a look at gridmake.

Run Number Assignments

Mark showed recent additions to the data challenge conditions page. He has added run number assignment to the proposed running conditions and file numbers assignments for the various sites. Richard asked for a larger file number range for the OSG.

We also decided that the number of events in each run (and thus in each file) should depend on the conditions at the individual sites; we will not try to make all files the same size, as we did last time. We will have to add that additional degree of freedom to our bookkeeping.

Next Meeting

We agreed that in a week from now we will make a go/no-go decision on whether we are ready to start. If all known problems are solved then there is no issue; if some remain we will have to discuss whether they are important enough to delay launch.