Difference between revisions of "GlueX Offline Meeting, July 6, 2016"

From GlueXWiki
Jump to: navigation, search
(Agenda)
(Slides)
 
(2 intermediate revisions by 2 users not shown)
Line 17: Line 17:
 
# Spring 2016 Run Processing Status (Paul, Alex)
 
# Spring 2016 Run Processing Status (Paul, Alex)
 
#* [https://mailman.jlab.org/pipermail/halld-offline/2016-June/002390.html Distributing REST files from initial launch] (Matt)
 
#* [https://mailman.jlab.org/pipermail/halld-offline/2016-June/002390.html Distributing REST files from initial launch] (Matt)
#* REST file I/O (Mike)
+
#* [https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html Launch Stats] (Alex)
 +
#* [https://halldweb.jlab.org/wiki/images/c/c0/RestRates6Jul2016.pdf REST file I/O] (Mike)
 
# simX.X (Sean)
 
# simX.X (Sean)
 
#* [https://github.com/JeffersonLab/gluex_simulations/tree/master/sim1.1 sim1.1 Conditions]
 
#* [https://github.com/JeffersonLab/gluex_simulations/tree/master/sim1.1 sim1.1 Conditions]
Line 37: Line 38:
  
 
Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2016</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2016/ .
 
Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2016</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2016/ .
 +
 +
== Minutes ==
 +
 +
You can [https://bluejeans.com/s/9Xee/ view a recording of this meeting] on the BlueJeans site.
 +
 +
Present:
 +
* '''CMU''': Naomi Jarvis, Curtis Meyer, Mike Staib
 +
* '''FSU''': Brad Cannon
 +
* '''GSI''': Nacer Hamdi
 +
* '''JLab''': Alexander Austregesilo, Alex Barnes, Mark Ito (chair), David Lawrence, Paul Mattione, Justin Stevens, Simon Taylor
 +
* '''NU''': Sean Dobbs
 +
* '''Regina''': Tegan Beattie
 +
* '''UConn''': Richard Jones
 +
 +
=== Announcements ===
 +
 +
# '''Intel Lustre upgrade'''. Mark reminded us about the [https://mailman.jlab.org/pipermail/jlab-scicomp-briefs/2016q2/000126.html upgrade] done a few weeks ago. Mark spoke with Dave Rackley earlier today July 6, 2016.
 +
#* Lustre on servers was installed, an Intel version. Call support is available. We pay for it.
 +
#* There were hangs after upgrade. After that the Intel Lustre client was installed on the ifarms. There have been no incidents since. Installs are still rolling out for farm and HPC nodes.
 +
#* Please report issues if they are encountered.
 +
# '''New release: sim-recon 2.1.0'''. [https://mailman.jlab.org/pipermail/halld-offline/2016-June/002387.html This release] came out about a month ago. A new should arrive this week.
 +
# '''REST backwards compatibility now broken'''. [https://mailman.jlab.org/pipermail/halld-physics/2016-May/000675.html Paul's email] describes the situation. You cannot read old REST files with new sim-recon code.
 +
# '''Raw data copy to cache'''. After [https://halldweb.jlab.org/talks/2016/raw_data_to_cache.pdf some discussion with the Computer Center], we will now have the first files from each run appear on the cache disk without having to fetch them from the Tape Library.
 +
# '''New HDPM "install" command'''. Nathan Sparks explains it in [https://mailman.jlab.org/pipermail/halld-offline/2016-May/002369.html his email]. It replaces the "fetch-dist" command.
 +
 +
=== New wiki documentation for HDDM ===
 +
 +
Richard led us through [https://mailman.jlab.org/pipermail/halld-offline/2016-July/002408.html his new wiki page] which consolidates and updates documentation for the HDDM package. A new feature is a Python API for HDDM. Here is the table of contents:
 +
<pre>
 +
    1 Introduction
 +
    2 Templates and schemas
 +
    3 How to get started
 +
    4 HDDM in python
 +
        4.1 writing hddm files in python
 +
        4.2 reading hddm files in python
 +
        4.3 advanced features of the python API
 +
    5 HDDM in C++
 +
        5.1 writing hddm files in C++
 +
        5.2 reading hddm files in C++
 +
        5.3 advanced features of the C++ API
 +
    6 HDDM in c
 +
        6.1 writing hddm files in c
 +
        6.2 reading hddm files in c
 +
        6.3 advanced features of the c API
 +
    7 Advanced features
 +
        7.1 on-the-fly compression/decompression
 +
        7.2 on-the-fly data integrity checks
 +
        7.3 random access to hddm records
 +
    8 References
 +
</pre>
 +
Some notes from the discussion:
 +
* If a lot of sparse single-event access is anticipated the zip format may be better because of the smaller buffer size. Bzip2 is the default now.
 +
* The random access features allows access to "bookmarks" for individual events that can be saved and used for quick access later, even for compressed files.
 +
* The Python API can be used in conjunction with PyROOT to write ROOT tree generators using any HDDM file as input quickly and economically.
 +
 +
==== REST file I/O ====
 +
 +
Mike described a throughput limit he has seen for compressed REST data vs. non-compressed. See [https://halldweb.jlab.org/wiki/images/c/c0/RestRates6Jul2016.pdf his slides] for plots and details. The single-threaded HDDM reader limits scaling with the number of event analysis threads if it is reading compressed data. The curve turns over at about 6 or 7 threads. On the other hand, compressed data presents less load on disk-read bandwidth, and so multiple jobs contending for that bandwidth might do better with compressed data.
 +
 +
Richard agreed to buffer input and launch a user-defined number of threads to do HDDM input. That should prevent starvation of the event analysis threads.
 +
 +
=== Review of minutes from June 8 ===
 +
 +
We went over [[GlueX Offline Meeting, June 8, 2016#Minutes|the minutes]].
 +
 +
* '''Small files are still being retained on the cache disk''', without automatic archiving to tape. Mark will repeat his plea for small file deletion soon.
 +
** Alex A. pointed out that it is now possible to pin small files and to force a write to tape. That was not the case a couple of weeks ago.
 +
** Sean reminded us that we had put in a request for a get-and-pin command from jcache. Mark will check on status.
 +
* '''RCDB is now fully integrated into the build_scripts system'''. It is now built on the JLab CUE on nodes where C++11 features are supported. You can now incorporate your RCDB C++ API calls in sim-recon plugins and SCons will do the right thing build-wise as long as you have the RCDB_HOME environment defined properly.
 +
 +
=== Spring 2016 Run Processing Status ===
 +
 +
==== Distributing REST files from initial launch ====
 +
 +
Richard, Curtis, and Sean commented on the REST file distribution process. Matt Shepherd copied "all" of the files from JLab to IU and has pushed them to UConn, CMU, and Northwestern, as per [https://mailman.jlab.org/pipermail/halld-offline/2016-June/002390.html his proposal], using Globus Online. He was able to get about 10 MB/s from JLab to IU. Similar speeds, within factors of a few, were obtained in the university-to-university transfers. All cautioned that one needs to think a bit carefully about network and networking hardware configurations to get acceptable bandwidth.
 +
 +
Alex A. cautioned us that there were some small files in Batch 1, and to a smaller extent in Batch 2, that either were lost before getting archived to tape, or that are in the Tape Library, but were not pinned and disappeared from the cache disk.
 +
 +
==== Launch Stats ====
 +
 +
Alex pointed us to [https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html Launch Stats webpage] that now contains links to the full reconstruction launches statistics page. We looked at [https://halldweb.jlab.org/data_monitoring/recon/summary_swif_output_recon_2016-02_ver01_batch01.html the page for Batch 01].
 +
* The page shows statistics on jobs run.
 +
* We discussed the plot of the number of jobs at each state of farm processing as a function of time. For the most part we were limited by the number of farm nodes, but there were times when we were waiting for raw data files from tape.
 +
* We never had more than about 500 jobs running at a time.
 +
* Memory usage was about 7 GB for Batch 1, a bit more for Batch 2.
 +
* The jobs ran with 14 threads.
 +
* One limit on farm performance was CLAS jobs which required large amounts of memory such that farm nodes were running with large fractions of idle cores.
 +
 +
=== mcsmear and CCDB variation setting ===
 +
 +
David noticed last week that he had to set the choose the mc variation of CCDB to get sensible results from the FCAL when running mcsmear. This was because of change in the way the non-linear energy correct was being applied. He asked whether we want to make this the default for mcsmear since it is only run on simulated data.
 +
 +
The situation is complicated by the fact that not all simulated data should use the mc variation. That is only appropriate for getting the "official" constants intended for simulating data already in the can. Note that if no variation is specified at all, then the default variation is used; that was the problem that David discovered.
 +
 +
After some discussion, we decided to ask Sean to put in a warning in mcsmear if no variation is named at all.
 +
 +
=== ROOT 6 upgrade? ===
 +
 +
Mark has done a test build of a recent version of our software with ROOT 6. We had mentioned that we should transition from 5 to 6 once we have established use of a C++-11 compliant compiler. That has been done now.
 +
 +
Paul pointed out that the change may break some ROOT macros used by individuals, including some used for calibration. On the other hand the change has to happen at some time.
 +
 +
Mark told us he will not make the change for the upcoming release, but will consider it for the one after that. In any case will discuss it further.

Latest revision as of 12:31, 7 July 2016

GlueX Offline Software Meeting
Wednesday, July 6, 2016
1:30 pm EDT
JLab: CEBAF Center F326/327

Agenda

  1. Announcements
    1. Intel Lustre upgrade (Mark)
    2. New release: sim-recon 2.1.0 (Mark)
    3. REST backwards compatibility now broken (Paul)
    4. Raw data copy to cache (David)
    5. New hdpm "install" command (Nathan)
    6. New wiki docs for hddm (Richard)
    7. Other announcements?
  2. Review of minutes from June 8 (all)
  3. Spring 2016 Run Processing Status (Paul, Alex)
  4. simX.X (Sean)
  5. mcsmear and CCDB variation setting (David)
  6. ROOT 6 upgrade? (Mark)
  7. Review of recent pull requests (all)
  8. Review of recent discussion on the Gluex Software Help List.
  9. Action Item Review

Communication Information

Remote Connection

Slides

Talks can be deposited in the directory /group/halld/www/halldweb/html/talks/2016 on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2016/ .

Minutes

You can view a recording of this meeting on the BlueJeans site.

Present:

  • CMU: Naomi Jarvis, Curtis Meyer, Mike Staib
  • FSU: Brad Cannon
  • GSI: Nacer Hamdi
  • JLab: Alexander Austregesilo, Alex Barnes, Mark Ito (chair), David Lawrence, Paul Mattione, Justin Stevens, Simon Taylor
  • NU: Sean Dobbs
  • Regina: Tegan Beattie
  • UConn: Richard Jones

Announcements

  1. Intel Lustre upgrade. Mark reminded us about the upgrade done a few weeks ago. Mark spoke with Dave Rackley earlier today July 6, 2016.
    • Lustre on servers was installed, an Intel version. Call support is available. We pay for it.
    • There were hangs after upgrade. After that the Intel Lustre client was installed on the ifarms. There have been no incidents since. Installs are still rolling out for farm and HPC nodes.
    • Please report issues if they are encountered.
  2. New release: sim-recon 2.1.0. This release came out about a month ago. A new should arrive this week.
  3. REST backwards compatibility now broken. Paul's email describes the situation. You cannot read old REST files with new sim-recon code.
  4. Raw data copy to cache. After some discussion with the Computer Center, we will now have the first files from each run appear on the cache disk without having to fetch them from the Tape Library.
  5. New HDPM "install" command. Nathan Sparks explains it in his email. It replaces the "fetch-dist" command.

New wiki documentation for HDDM

Richard led us through his new wiki page which consolidates and updates documentation for the HDDM package. A new feature is a Python API for HDDM. Here is the table of contents:

    1 Introduction
    2 Templates and schemas
    3 How to get started
    4 HDDM in python
        4.1 writing hddm files in python
        4.2 reading hddm files in python
        4.3 advanced features of the python API
    5 HDDM in C++
        5.1 writing hddm files in C++
        5.2 reading hddm files in C++
        5.3 advanced features of the C++ API
    6 HDDM in c
        6.1 writing hddm files in c
        6.2 reading hddm files in c
        6.3 advanced features of the c API
    7 Advanced features
        7.1 on-the-fly compression/decompression
        7.2 on-the-fly data integrity checks
        7.3 random access to hddm records
    8 References

Some notes from the discussion:

  • If a lot of sparse single-event access is anticipated the zip format may be better because of the smaller buffer size. Bzip2 is the default now.
  • The random access features allows access to "bookmarks" for individual events that can be saved and used for quick access later, even for compressed files.
  • The Python API can be used in conjunction with PyROOT to write ROOT tree generators using any HDDM file as input quickly and economically.

REST file I/O

Mike described a throughput limit he has seen for compressed REST data vs. non-compressed. See his slides for plots and details. The single-threaded HDDM reader limits scaling with the number of event analysis threads if it is reading compressed data. The curve turns over at about 6 or 7 threads. On the other hand, compressed data presents less load on disk-read bandwidth, and so multiple jobs contending for that bandwidth might do better with compressed data.

Richard agreed to buffer input and launch a user-defined number of threads to do HDDM input. That should prevent starvation of the event analysis threads.

Review of minutes from June 8

We went over the minutes.

  • Small files are still being retained on the cache disk, without automatic archiving to tape. Mark will repeat his plea for small file deletion soon.
    • Alex A. pointed out that it is now possible to pin small files and to force a write to tape. That was not the case a couple of weeks ago.
    • Sean reminded us that we had put in a request for a get-and-pin command from jcache. Mark will check on status.
  • RCDB is now fully integrated into the build_scripts system. It is now built on the JLab CUE on nodes where C++11 features are supported. You can now incorporate your RCDB C++ API calls in sim-recon plugins and SCons will do the right thing build-wise as long as you have the RCDB_HOME environment defined properly.

Spring 2016 Run Processing Status

Distributing REST files from initial launch

Richard, Curtis, and Sean commented on the REST file distribution process. Matt Shepherd copied "all" of the files from JLab to IU and has pushed them to UConn, CMU, and Northwestern, as per his proposal, using Globus Online. He was able to get about 10 MB/s from JLab to IU. Similar speeds, within factors of a few, were obtained in the university-to-university transfers. All cautioned that one needs to think a bit carefully about network and networking hardware configurations to get acceptable bandwidth.

Alex A. cautioned us that there were some small files in Batch 1, and to a smaller extent in Batch 2, that either were lost before getting archived to tape, or that are in the Tape Library, but were not pinned and disappeared from the cache disk.

Launch Stats

Alex pointed us to Launch Stats webpage that now contains links to the full reconstruction launches statistics page. We looked at the page for Batch 01.

  • The page shows statistics on jobs run.
  • We discussed the plot of the number of jobs at each state of farm processing as a function of time. For the most part we were limited by the number of farm nodes, but there were times when we were waiting for raw data files from tape.
  • We never had more than about 500 jobs running at a time.
  • Memory usage was about 7 GB for Batch 1, a bit more for Batch 2.
  • The jobs ran with 14 threads.
  • One limit on farm performance was CLAS jobs which required large amounts of memory such that farm nodes were running with large fractions of idle cores.

mcsmear and CCDB variation setting

David noticed last week that he had to set the choose the mc variation of CCDB to get sensible results from the FCAL when running mcsmear. This was because of change in the way the non-linear energy correct was being applied. He asked whether we want to make this the default for mcsmear since it is only run on simulated data.

The situation is complicated by the fact that not all simulated data should use the mc variation. That is only appropriate for getting the "official" constants intended for simulating data already in the can. Note that if no variation is specified at all, then the default variation is used; that was the problem that David discovered.

After some discussion, we decided to ask Sean to put in a warning in mcsmear if no variation is named at all.

ROOT 6 upgrade?

Mark has done a test build of a recent version of our software with ROOT 6. We had mentioned that we should transition from 5 to 6 once we have established use of a C++-11 compliant compiler. That has been done now.

Paul pointed out that the change may break some ROOT macros used by individuals, including some used for calibration. On the other hand the change has to happen at some time.

Mark told us he will not make the change for the upcoming release, but will consider it for the one after that. In any case will discuss it further.