GlueX Data Challenge Meeting, December 17, 2012
GlueX Data Challenge Meeting
Monday, December 17, 2012
1:30 pm, EDT
JLab: CEBAF Center, F326/327
- Minutes from last time
- Data Challenge 1 status
- Grid status
- CMU status
- Shutdown plan (or continuation plan?)
- Work list for post DC-1 period
- file archiving
- file distribution
- Thoughts on DC-2
- How much?
To connect from the outside:
- Call ESNET Number 8542553 (this is the preferred connection method).
- Phone: (should not be needed)
- +1-866-740-1260 : US and Canada
- +1-303-248-0285 : International
- then use participant code: 3421244# (the # is needed when using the phone)
- or www.readytalk.com
- then type access code 3421244 into "join a meeting" (you need java plugin)
- CMU: Paul Mattione
- JLab: Mark Ito (chair), David Lawrence, Yi Qiang, Dmitry Romanov, Elton Smith, Simon Taylor, Beni Zihlmann
- UConn: Richard Jones
Data Challenge 1 status
Production started at the three sites Wednesday, December 5, as planned.
We updated progress at the various sites:
- JLab: 678 million events
- Grid: 3.4 billion events
- CMU: 270 million events
See the Data Challenge 1 page for a few more details.
We ran down some of the problems encountered:
- A lot of the time getting the grid effort started was spent correcting problems. Since some jobs, after resubmitting themselves after crashing would crash again, activity got into a state where a majority of the jobs were in this infinite loop and had to be stopped by hand. This was solved by lowering the number of resubmissions allowed.
- There were occasional segmentation faults in hdgeant. Richard is investigating the cause.
- mcsmear would sometimes hang. David and Richard chased this down to the processing thread taking more than 30 seconds with and event and then killing and re-launching itself without releasing the mutex lock for the output file.
- Re-running the job fixed this problem because mcsmear was seeded differently each time.
- The lock-release problem will be fixed.
- We have to find out why it can take more than 30 seconds to smear an event.
- The default behavior should be changed to a hard crash. Re-launching threads could still be retained as an option.
- At JLab some jobs would not produce output files, but would only end after exceeding the job CPU limit.
- Also at JLab, some of the REST format files did not have the full 50,000 events.
- There may be other failure modes that we have not cataloged. We will at least try to figure out what happened with all failures.
- At the start of the grid effort the submission node crashed. It was replaced with a machine with more memory which solved the problem. We peaked ou8t at 7,000 grid jobs running simultaneously. This was about 10% of the total grid capacity.
- Another host for the grid system, the user scheduler which maintains a daemon for each job, also needed more memory to function under this load.
- The storage resource manager (SRM), that does the transfer of the output files back to UConn in this case was very reliable. The gigabit pipe back to UConn was essentially filled during this effort.
- Richard thought that next time we should do 100 million events and then go back and debug the code. Mark reminded us that the thinking was that the failure rate was low enough to do useful work and that it was more important to get the data challenge going and learn our lessons, since we will have other challenges in the future. [Note added in press: coincidentally, 100 million was the size of our standard mini-challenge. Folks will recall that those challenges started out with unacceptable failure rates and iterated to iron out the kinks.]
Curtis sent around an email with his assessment of our status and where he thinks we should go from here. Most notably, we suggests we write a report on DC-1.
There was consensus that given that we have already exceeded out original goals by over a factor of two that we should stop submitting more jobs and assess where we are. The expectation is that currently submitted jobs will run out in a day or two.
Work list for post DC-1 period
- We decided that we would archive all files to the JLab tape library, REST files, ROOT files, and log files. Details have to be worked out, but we should do this right away.
- To distribute the data, we will move all of the REST data to UConn and make it available via the SRM. Note that most of the data is at UConn already anyway.
- We will also try to have all of the REST data on disk at JLab.
- We should look into SURA grid and see if we have any claim on its resources.
- Paul suggested doing skims of selected topologies for use by individuals doing specific analyses. Those interested in particular types of events should think about making proposals.
- Richard suggested we develop a Jana plug-in to read data using the SRM directly. The only URL would have to be known and data could be streamed in.
- To enable general access to the data, we decided that we all get grid certificates, i. e., obtain credentials for the entire collaboration. Richard will send instructions on how to get started with this.
- Problems to address:
- seg faults in hdgeant
- hangs in mcsmear
- random number seed control
Thoughts on DC-2
We need to start thinking about the next data challenge, in particular, goals and schedule.