Difference between revisions of "GlueX Data Challenge Meeting, March 7, 2014"

From GlueXWiki
Jump to: navigation, search
(Agenda)
m (Text replacement - "/halldweb1.jlab.org/" to "/halldweb.jlab.org/")
 
(3 intermediate revisions by the same user not shown)
Line 13: Line 13:
 
# Review of [[GlueX Data Challenge Meeting, February 28, 2014#Minutes|minutes from last time]]
 
# Review of [[GlueX Data Challenge Meeting, February 28, 2014#Minutes|minutes from last time]]
 
# [[Random Number Seeds Procedure (as of 2014-03-07)]]? Mark/Anyone
 
# [[Random Number Seeds Procedure (as of 2014-03-07)]]? Mark/Anyone
# ZFATAL fix (Richard)
+
# [https://mailman.jlab.org/pipermail/halld-offline/2014-March/001527.html ZFATAL fix] (Richard)
 
# Short file issue (all)
 
# Short file issue (all)
 +
#* [https://mailman.jlab.org/pipermail/halld-offline/2014-March/001536.html non-reproducible results]
 
# Running jobs at CMU (Paul)
 
# Running jobs at CMU (Paul)
 
# Running jobs at NU (Sean)
 
# Running jobs at NU (Sean)
Line 23: Line 24:
 
# [http://hadron.physics.fsu.edu/~aristeidis/offline_challenge.pdf Running jobs at FSU] (Aristeidis)
 
# [http://hadron.physics.fsu.edu/~aristeidis/offline_challenge.pdf Running jobs at FSU] (Aristeidis)
 
# Electromagnetic Backgrounds update.  Paul/Kei
 
# Electromagnetic Backgrounds update.  Paul/Kei
# [https://halldweb1.jlab.org/data_challenge/02/conditions/data_challenge_2.html Run number assignments] (Mark)
+
# [https://halldweb.jlab.org/data_challenge/02/conditions/data_challenge_2.html Run number assignments] (Mark)
 
# Proposed Schedule
 
# Proposed Schedule
 
#* Launch of Data Challenge Thursday March 6, 2014 (est.).
 
#* Launch of Data Challenge Thursday March 6, 2014 (est.).
Line 29: Line 30:
 
#* Distribution ready by Monday March 3.
 
#* Distribution ready by Monday March 3.
 
# AOT
 
# AOT
 +
 +
=Minutes=
 +
 +
Present:
 +
* '''CMU''': Paul Mattione
 +
* '''FSU''': Aristeidis Tsaris
 +
* '''JLab''': Mark Dalton, Mark Ito (chair), Chris Larrieu, Beni Zihlmann
 +
* '''MIT''': Justin Stevens
 +
* '''NU''': Sean Dobbs
 +
* '''UConn''': Richard Jones
 +
 +
Find a recording of the meeting [https://halldweb.jlab.org/talks/2014-1Q/data_challenge_2014-03-07/index.htm here].
 +
 +
==Random Number Seed Status==
 +
 +
Mark led us through his [[Random Number Seeds Procedure (as of 2014-03-07)|wiki page]] with notes from an interview with David Lawrence. Changes may be coming.
 +
 +
==ZFATAL Fix==
 +
 +
Richard gave us a few more details about the fix, beyond [https://mailman.jlab.org/pipermail/halld-offline/2014-March/001527.html his email].
 +
 +
There is a maximum of 64,000 pointers for Zebra memory management. We were broaching this at high beam rate with EM background turned on. Some time ago Beni introduced a change where secondaries are put on the primary stack in order to track their genealogy. This change has proved useful. At the same time, Beni also exempted particles produced in showers in the calorimeters; there are too many of those and their parentage is generally not of interest. With EM background turned on, showers in the beam collimators can occur, but before now no exemption was in place for them. Richard put in this needed exemption.
 +
 +
We agreed that this fixes the ZFATAL issue that Paul had reported previously.
 +
 +
==Short File Issue and Non-Reproducible Results==
 +
 +
Paul went through [https://mailman.jlab.org/pipermail/halld-offline/2014-March/001536.html his recent email] on non-reproducibility of the code. This may or may not be related to the short file issue, but needs addressing in any case.
 +
 +
Chris noted that often results can be non-reproducible due to off-by-one errors, where uninitialized array elements are accessed by mistake.
 +
 +
Justin's sent out [https://mailman.jlab.org/pipermail/halld-offline/2014-March/001532.html email to the group] where he concludes that enabling the compression of the REST output is indeed correlated with short REST files. Mark confirms this in his recent running at JLab.
 +
 +
Richard is working on fixing this problem.
 +
 +
==Running at CMU==
 +
 +
Paul mentioned that the b1pi test is broken again.
 +
 +
He reported job on run times:
 +
{|
 +
|-
 +
! photon rate || k events || hours
 +
|-
 +
| No EM || 50 || 12
 +
|-
 +
| 1&times;10<sup>7</sup> || 25 || 24
 +
|-
 +
| 5&times;10<sup>7</sup> || 5 || 18
 +
|}
 +
 +
We will probably have to cut back on the amount of high intensity running we do.
 +
 +
==Running at NU==
 +
 +
Sean has been running jobs with 20 k events and seeing execution times of 12 to 18 hours. He also sees short REST files.
 +
 +
==Running at MIT==
 +
 +
More CPUs have been added to the MIT cluster. In addition, he has started using nodes on FutureGrid, which also uses OpenStack.
 +
 +
==Running at JLab==
 +
 +
SciComp will be able to move the nodes what we have been lending to the High Performance Computing (HPC) cluster with a few days notice. Once we get going we can ask for more nodes from HPC. The plan is to provide us with 1250 cores.
 +
 +
SciComp is very reluctant to allow us to install the SRM at JLab even if we provide the manpower. They would need to review the security properties in order to approve usage and they do not have the manpower to do that in the near term. They are encouraging us to use Globus Online to transfer files in and out of JLab. Richard told us that Globus Online is not appropriate for our use case, in particular doing transfers in batch mode. Chris asked if the Globus Online command-line tool provides the needed functionality, but Richard did not think that it did.
 +
 +
==Running at FSU==
 +
 +
Aristeidis presented [http://hadron.physics.fsu.edu/~aristeidis/offline_challenge.pdf statistics on jobs] he has run at FSU. He also reported some problems with building the latest versions of the code. Richard suggested that he take a look at gridmake.
 +
 +
==Run Number Assignments==
 +
 +
Mark showed recent additions to the [https://halldweb.jlab.org/data_challenge/02/conditions/data_challenge_2.html data challenge conditions page]. He has added run number assignment to the proposed running conditions and file numbers assignments for the various sites. Richard asked for a larger file number range for the OSG.
 +
 +
We also decided that the number of events in each run (and thus in each file) should depend on the conditions at the individual sites; we will not try to make all files the same size, as we did last time. We will have to add that additional degree of freedom to our bookkeeping.
 +
 +
==Next Meeting==
 +
 +
We agreed that in a week from now we will make a go/no-go decision on whether we are ready to start. If all known problems are solved then there is no issue; if some remain we will have to discuss whether they are important enough to delay launch.

Latest revision as of 10:52, 31 March 2015

GlueX Data Challenge Meeting
Friday, March 7, 2014
11:00 pm, EST
JLab: CEBAF Center, F326
ESNet: 8542553
SeeVogh: http://research.seevogh.com
ReadyTalk: (866)740-1260, access code: 1833622
ReadyTalk desktop: http://esnet.readytalk.com/ , access code: 1833622

Agenda

  1. Announcements
  2. Review of minutes from last time
  3. Random Number Seeds Procedure (as of 2014-03-07)? Mark/Anyone
  4. ZFATAL fix (Richard)
  5. Short file issue (all)
  6. Running jobs at CMU (Paul)
  7. Running jobs at NU (Sean)
  8. Running jobs at MIT (Justin)
  9. Running jobs at JLab (Mark)
    1. Nodes at JLab
    2. SRM update
  10. Running jobs at FSU (Aristeidis)
  11. Electromagnetic Backgrounds update. Paul/Kei
  12. Run number assignments (Mark)
  13. Proposed Schedule
    • Launch of Data Challenge Thursday March 6, 2014 (est.).
    • Test jobs going successfully by Tuesday March 4.
    • Distribution ready by Monday March 3.
  14. AOT

Minutes

Present:

  • CMU: Paul Mattione
  • FSU: Aristeidis Tsaris
  • JLab: Mark Dalton, Mark Ito (chair), Chris Larrieu, Beni Zihlmann
  • MIT: Justin Stevens
  • NU: Sean Dobbs
  • UConn: Richard Jones

Find a recording of the meeting here.

Random Number Seed Status

Mark led us through his wiki page with notes from an interview with David Lawrence. Changes may be coming.

ZFATAL Fix

Richard gave us a few more details about the fix, beyond his email.

There is a maximum of 64,000 pointers for Zebra memory management. We were broaching this at high beam rate with EM background turned on. Some time ago Beni introduced a change where secondaries are put on the primary stack in order to track their genealogy. This change has proved useful. At the same time, Beni also exempted particles produced in showers in the calorimeters; there are too many of those and their parentage is generally not of interest. With EM background turned on, showers in the beam collimators can occur, but before now no exemption was in place for them. Richard put in this needed exemption.

We agreed that this fixes the ZFATAL issue that Paul had reported previously.

Short File Issue and Non-Reproducible Results

Paul went through his recent email on non-reproducibility of the code. This may or may not be related to the short file issue, but needs addressing in any case.

Chris noted that often results can be non-reproducible due to off-by-one errors, where uninitialized array elements are accessed by mistake.

Justin's sent out email to the group where he concludes that enabling the compression of the REST output is indeed correlated with short REST files. Mark confirms this in his recent running at JLab.

Richard is working on fixing this problem.

Running at CMU

Paul mentioned that the b1pi test is broken again.

He reported job on run times:

photon rate k events hours
No EM 50 12
1×107 25 24
5×107 5 18

We will probably have to cut back on the amount of high intensity running we do.

Running at NU

Sean has been running jobs with 20 k events and seeing execution times of 12 to 18 hours. He also sees short REST files.

Running at MIT

More CPUs have been added to the MIT cluster. In addition, he has started using nodes on FutureGrid, which also uses OpenStack.

Running at JLab

SciComp will be able to move the nodes what we have been lending to the High Performance Computing (HPC) cluster with a few days notice. Once we get going we can ask for more nodes from HPC. The plan is to provide us with 1250 cores.

SciComp is very reluctant to allow us to install the SRM at JLab even if we provide the manpower. They would need to review the security properties in order to approve usage and they do not have the manpower to do that in the near term. They are encouraging us to use Globus Online to transfer files in and out of JLab. Richard told us that Globus Online is not appropriate for our use case, in particular doing transfers in batch mode. Chris asked if the Globus Online command-line tool provides the needed functionality, but Richard did not think that it did.

Running at FSU

Aristeidis presented statistics on jobs he has run at FSU. He also reported some problems with building the latest versions of the code. Richard suggested that he take a look at gridmake.

Run Number Assignments

Mark showed recent additions to the data challenge conditions page. He has added run number assignment to the proposed running conditions and file numbers assignments for the various sites. Richard asked for a larger file number range for the OSG.

We also decided that the number of events in each run (and thus in each file) should depend on the conditions at the individual sites; we will not try to make all files the same size, as we did last time. We will have to add that additional degree of freedom to our bookkeeping.

Next Meeting

We agreed that in a week from now we will make a go/no-go decision on whether we are ready to start. If all known problems are solved then there is no issue; if some remain we will have to discuss whether they are important enough to delay launch.