GlueX Offline Meeting, May 25, 2016

GlueX Offline Software Meeting
Wednesday, May 25, 2016
1:30 pm EDT
JLab: CEBAF Center F326/327

Agenda

Announcements
1. Write-through Cache
2. Other announcements?
Review of minutes from April 27 (all)
Copying data to off-site locations (all)
Disk space requirements for Spring 2016 data (all)
- Current EVIO skim sizes
Calibration Challenge/Processing (Sean)
Spring 2016 Run, Processing Plans
1. Metadata for this processing launch (GlueX note) (Sean)
Negative Parity in Tracking code (David)
sim1.1? (all)
HDGeant4 news (Mark)
Review of recent pull requests (all)
Action Item Review

Communication Information

Remote Connection

The BlueJeans meeting number is 968 592 007 .
Join the Meeting via BlueJeans

Slides

Talks can be deposited in the directory /group/halld/www/halldweb/html/talks/2016 on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2016/ .

Minutes

You can view a recording of this meeting on the BlueJeans site.

Present:

CMU: Curtis Meyer
FIU: Mahmoud Kamel
IU: Matt Shepherd
JLab: Alexander Austregesilo, Amber Boehnlein, Graham Heyes, Mark Ito (chair), David Lawrence, Paul Mattione, Sandy Philpott, Nathan Sparks, Justin Stevens, Adesh Subedi, Simon Taylor, Chip Watson
MIT: Christiano Fanelli
NU: Sean Dobbs

Review of minutes from April 27

Reminder: we switch to GCC greater or equal to 4.8 on June 1.
Mark raised the question of whether corrections to data values in the RCDB should properly be reported as a software issue on GitHub. An alternate forum was not proposed. We left it as something to think about.

Announcement: Write-Through Cache

Mark reminded us about Jie Chen's email announcing the switch-over to write-through cache in favor of the read-only cache. The change moves toward symmetrization of operations between LQCD and ENP.
Mark cautioned us that we should treat this as an interface to the tape library, and not as infinite "work" disk. This means we should not be writing many small files to it and we need to think carefully about the implied directory structure in the /mss tree that will result from new directories created on the cache disk.
Chip pointed out that the write-through cache facilitates writing data from off-site to the tape library by allowing a pure disk access to do the job.
The size threshold for writing to tape will be set to zero. Other parameters will be documented on the SciComp web pages.
At some point we need to do a purge of small files already existing on the cache. These will not be a problem for a few months, so we have some time to get around to it.
There is an ability to write data immediately to tape and optionally delete it from the cache.
At some point in the future, Paul will give us a talk on best-practices for using the write-through cache.

Copying data to off-site locations

Data Copying

We went over the recent email thread centered around bandwidth for transferring data off-site (expected bandwidth offsite).

Matt had started the thread by asking what the maximum possible transfer rate might be.
Concern was expressed that if we truly wanted to go as fast as possible, that might cause back-ups on-site for data transfers related to farm operation.
Numbers from Chip:
- network pipe to the outside: 10 Gbit/s
- tape bandwidth: 2 GByte/s
- disk access: 10-20 GByte/s
If collaborators were to try to pull data (indirectly) from tape in a way that saturated the network going out, that could use roughly half of the tape bandwidth. However, that scenario is not very likely. The interest is in REST data, and those data are produced at a rate much, much less than the full bandwidth of the tape system. We had agreed that an average rate of 100 MB per second was sufficient for the collaboration and easily provided given the current configuration at the Lab. Special measures to throttle use will probably not be necessary.
Matt's offer of resources he gets for quasi-free from IU could be used to create a staging site for data off-site from which other collaborating institutions can draw the data. This would off-load much of the demand from the Lab potentially by a factor of several.
These points were largely settled in the course of the email discussion.

Discussion with OSG Folks

Amber, Mark, and Richard Jones had a video-conference on Monday (May 23) with Frank Wuerthwein and Rob Gardner of the Open Science Grid to discuss strengthening OSG-related capabilities at JLab.

Frank proposed working on two issues: job submission from JLab and data transfer in and out of the Lab using OSG resources and tools. Initially he proposed getting the job submission going first, but after hearing about our need to ship REST data off-site, modified the proposal to give the efforts equal weight.
Richard was able to fill in the OSG guys on what was already in place and what has been done in the past with the GlueX Virtual Organization (VO).
For data transfer, XROOTD was identified as a likely technology to use despite the fact that REST data is not ROOT based.
Rob and Frank will go away and think about firming up details of the proposal in both areas given their understanding of our requirements. They will get back to us when they have a formulation.

Future Practices for Data Transfer

Amber encouraged us to think about long-term solutions to the data transfer problem, rather than focusing exclusively on the "current crisis". In particular we should be thinking about solutions that are (a) appropriate for the other Halls and (b) that have robust support in the HENP community generally. The LHC community has had proven success in this area and we are in a good position to leverage their experience.

To do this will require more detailed specification of requirements than we have provided thus far as well as communication of such requirements among the Halls. With this planning, possible modest improvements in infrastructure, as well as on-site support of this functionality, can proceed with confidence.

We all agreed that this was a pretty good idea.

Disk space requirements for Spring 2016 data

Sean discussed the disk space being taken up by skims from the recent data run. See his table for the numbers [FCAL and BCAL column headings should be reversed.]

Currently, all of the pair spectrometer skims are pinned. Files need to be present on the disk before a Globus Online request for copy can be made against them. Richard is trying to get them all to UConn.
The FCAL and BCAL skims have likely been wiped off the disk by the auto-deletion algorithm.
- Both can be reduced in size by dropping some of the tags. Sean has code to do that, but it needs more testing before being put into production.
The total size of all skims is about 10% of the raw data.

We discussed various aspects of managing our Lustre disk space. I looks like for this summer, we will need about 200 TB of space, counting volatile, cache, and work. The system is new to us, we have some learning to do before we can use it efficiently.

Spring 2016 Run, Processing Plans

Paul noted that this was discussed fully at yesterday's Analysis Meeting. In summary we will produce:

reconstructed data
skims of raw data
TTrees
EventStore meta-data

In addition, job submission scripts are in need of some re-writing.

Meta-data for this processing launch

Sean presented a system for classifying and documenting key aspects of the various data sets that we will have to handle. He guided us through a web page he put together that displays the information. It has a pull-down menu to choose the run period being queried. There is a legend at the bottom that describes each of the fields.

This system has already been in use for the monitoring launch and the information is used in the monitoring web pages to navigate the different versions of those launches. Also the data will be used to correlate data sets with the EventStore. Sean is proposing that the data version string be written into each REST file so that there is a two-way link between EventStore meta data and the data itself.

Mark suggested that we might want to have a more formal relational database structure for the information. This would require some re-writing of a working system but may be worth the effort.

Sean has written a GlueX Note motivating and documenting the system.

Negative Parity in Tracking code

David is working on new code to parse the EVIO data. In the course of that work he was comparing results between the old and new parser and noticed some small differences. He tracked these down to several issues. Some of them have to do with the new parser presenting the hits to the reconstruction code in a different order than that presented by the old parser. See his slides for all of the details.

In summary the issues were:

In the track finding code, there is an assumption about the order of hits. That order was never enforced in the code. In turn that assumption was causing a variance, used as a cut criterion used for hit inclusion on a track, to be calculated on different numbers of hits, depending on order.
In FDC pseudo-hit creation the "first hit" on a wire was chosen for inclusion. That choice was manifestly hit-order dependent.
There was a bug in the FDC Cathode clustering code that caused multiple hits on a single cathode to be split onto different clusters. That bug manifested itself in a way that depended on hit order.
For some events, a perfectly fine start counter hit was ignored in determining the drift start time for a track. That was a result of the reference trajectory in the fringe field area being calculated using a field that depended on the history of the reference trajectory object's recycling history. In certain cases a bad, left-over field value would give a trajectory where the intersection of the track with the projection of the start counter plane in the r-phi view was unacceptably far downstream of the physical start counter in z.

These have all been fixed on a branch. David will look at moving the changes onto the master branch.

HDGeant4 Workflow

At the collaboration meeting Mark talked to Richard about developing a work flow for HDGeant4 that would allow early testing by collaborators as well as controlled contributions. Mark reported that they agreed to a plan in principle last week.

Note that this work flow is not the same as that we have been using for sim-recon, in particular it will require contributors to compose pull requests based on a private clone of the HDGeant4 repository on GitHub rather than from a branch of the "Jefferson Lab" repository. Most collaborators will not have privilege to create such a branch on the JLab repository.