GlueX Offline Software Meeting
Wednesday, January 21, 2015
1:30 pm EST
JLab: CEBAF Center F326/327

Agenda

Announcements
1. Volatile disk expanded: reservation 10 -> 20 TB, quota 30 -> 50 TB
2. Marty Wise working on Run Conditions (Control?) Database (RCDB)
3. Computer Center has RHEL7 available for beta testers
4. Work disk full
Review of minutes from January 7 (all)
Data Challenge 3
Software Review Preparations
Commissioning Run Review:
1. Offline Monitoring Report (Kei)
  1. Ran over all files (online plugins, 2-track EVIO skim, REST) 2 weeks ago
  2. Next launch is this Friday
  3. Will be testing EventStore to mark events
  4. Quick update on CentOS65, multithread processing
2. Commissioning-branch-to-trunk migration (Simon)
3. Handling changing magnetic field settings (Sean)
4. Analysis of REST file data (Justin)
5. Data Management (Sean)
  1. Storing software information in REST files
  2. EVIO format definition for Level 3 trigger farm
  3. EventStore: implementation plan
Requests to SciComp on farm features (Kei)
HDDM versions and backward compatibility
Action Item Review

Communication Information

Remote Connection

The BlueJeans meeting number is 968 592 007 .
Join the Meeting via BlueJeans

Slides

Talks can be deposited in the directory /group/halld/www/halldweb/html/talks/2015 on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2015/ .

Minutes

Present:

CMU: Curtis Meyer
FIU: Mahmoud Kamel
FSU: Aristeidis Tsaris
JLab: Alex Barnes, Mark Ito (chair), David Lawrence, Paul Mattione, Kei Moriya, Eric Pooser, Simon Taylor, Beni Zihlmann
NU: Sean Dobbs

Announcements

Our volatile disk was expanded recently. The reservation increased from 10 to 20 TB, and the quota from 30 to 50 TB. We are using just over 20 TB presently.
Marty Wise of Computing and Network Infrastructure (CNI) is working on installing the Run Conditions Database (RCDB) on an Apache server.
CNI now has a desktop version of RedHat Enterprise Linux 7 available for beta testers. See Kelvin Edwards for an install image.
Our work disk filled up this morning. We have 14 TB at present. Volunteers deleting their files have got it down to 75% used now.
Mark remarked that we should review our long-term requests for disk space and see if we can start to expand our disk portfolio in a significant way.

Review of Minutes from January 7

We went over the minutes. Items were either resolved or appear on the agenda for this meeting.

Data Challenge 3

Mark has successfully run test jobs going all the way from event generation to REST file production. Along the way EVIO-formatted data is produced and read. He showed some statistics about the test jobs presented at the last Software Review Preparation Meeting. The next step is to scale the jobs to real-challenge size. We hope to be in production by the time of the Software Review.

Software Review Preparations

Curtis reviewed the discussion we had at last Friday's Software Review Preparations Meeting. We spent some time answering questions from Graham about out needs vis-a-vis the schedule for computer procurements. We also ran down a list of talking points and topics that we plan to present.

Commissioning Run Review

Offline Monitoring Report

Kei gave the report.

He ran over all files (online plugins, 2-track EVIO skim, REST) 2 weeks ago
Next launch of the entire process is this Friday.
The group will be testing EventStore to mark events. This will take some dedicated disk space.
Kei showed slides, giving an update on CentOS65 use and multi-thread processing.

Commissioning-Branch-to-Trunk Migration

Simon reported that he and Mark have started working on migration of code developed on the commissioning branch during the run to the trunk in the source code repository. An initial attempt have version that compiled and ran, but when the b1pi test was run with the code, no successful kinematic fits were produced.

Paul asked if the Monte Carlo variation was being used; it was not. This will be tried next.

Analysis of REST File Data

Justin reported that he has had success reproducing his recent bump-hunting plots starting from REST formatted data. This mode would allow users to pursue similar studies without having to fetch the data from tape and perform reconstruction; a big time savings. He did this with a private version of the code. There is currently and issue with unpacking tagger hits from the REST file. Hopefully this can be fixed before the next generation of REST files are produced.

Handling Changing Magnetic Field Setting

Quoting from a recent email from Sean:

One of the bigger headaches of running over the fall data was keeping
track of all of the different magnetic field conditions, as the field
went up and down.  It would be user-friendly if we could keep track
of this information as well, instead of forcing the user to specify
the correct magnetic field map on the command line every time.
Naively, I'd think that we could add a CCDB table that stored the
name of the magnetic field map to use, i.e., that same information
that would be passed in on the command line.  Maybe this information
is better stored as geometry or something else, though?

Mark remarked that we did have a plan for handling this problem using the CCDB and JANA Resources. David, Sean, and Mark will get together offline to revisit the plan.

Data Management

Quoting from the same email from Sean, three items:

Storing software information in REST files

Since we're storing information on the software conditions used for
reconstruction, it might be nice to store some of this information in
the "officially" created REST files themselves, for a certain amount
of self-documentation.

Mark thought that it should be possible to add a "software version" element to the rest format, independent of the physics events, at the beginning of the file. Paul will ask Richard Jones about how this might be done.

EVIO format definition for Level 3 trigger farm

Is running the L3 trigger farm a goal of the spring running?  If so,
it would be useful to define the EVIO output format that would be
used.  I seem to remember that even if we run in pass-through mode,
the L3 farm could be used to disentangle multi-block EVIO events, and
output them in single-block format.

David remarked that disentangling was fundamental in the L3 design and any output format from L3, would be in single-blocked form.

EventStore: implementation plan

One thing that could save the amount of disk space needed for
handling skims would be the EventStore DB, the development of which
I've taken back up.  However, the user would still need access to
these files, so it would only help for people running over the data
at JLab.  So in the end, there might be still be a desire for us to
make these files, for those who want to analyze the files at their
home institutions.

The exact model we will use has not been decided on. Mark thought that to first order we would try to distribute at least the REST formatted data to each institution and therefore each site could have a functional EventStore-based system. This should be do-able for early running at least.

Requests to SciComp on Farm Features

Kei led us through a set of questions and feature requests he sent to SciComp. These were collected from the group working on offline monitoring.

Tools to track jobs:
1. tools to track what percentage of nodes were being used by whom at a given time, preferably in both # of jobs and threads.We can see the pie charts for example in http://scicomp.jlab.org/scicomp/#/auger/usage but would like the information in a form that we can easily access and analyze.
2. what % of nodes are currently available for each OS at a given time
3. tools to track the life time of each stage of the job, such as sitting in queue, waiting for files from tape, running, etc.
4. Would it be possible to make the stdout and stderr web-viewable?
5. If possible, can you add the ability to search by “job name” (every job that includes the search term) in the auger custom job query website?
For more general requests:
1. better transparency for whether there are problems in the system, such as heavy traffic due to users, broken disks, etc. Could there be an email list/webpage for that information?
2. clarification of how 'priority' of jobs works between different halls and users.
3. would it be possible for the system to auto-resubmit failed jobs if the failure is on the side of the system (e.g., bad farm nodes, temporary loss of connection)?
Additionally, ask for more space on cache disk?

There is a meeting tomorrow with SciComp personnel to go over the list. Interested parties should attend.

Action Items

Ask Richard about a new software information element in the REST format. (Paul)
Meet to figure out magnetic field map handling using CCDB and Resources. (David, Sean, Mark)

GlueX Offline Meeting, January 21, 2015

Contents