OWG Meeting 11-Jan-2017

From GlueXWiki
Jump to: navigation, search

Location and Time

Room: CC F326

Time: 2:00pm-3:00pm

Connection

You can connect using BlueJeans Video conferencing (ID: 120 390 084). (Click "Expand" to the right for details -->):

(if problems, call phone in conference room: 757-269-6460)

  1. To join via Polycom room system go to the IP Address: 199.48.152.152 (bjn.vc) and enter the meeting ID: 120390084.
  2. To join via a Web Browser, go to the page https://bluejeans.com/120390084.
  3. To join via phone, use one of the following numbers and the Conference ID: 120390084
    • US or Canada: +1 408 740 7256 or
    • US or Canada: +1 888 240 2560
  4. More information on connecting to bluejeans is available.


Previous Meeting

Agenda

  1. Announcements
  2. DAQ specs for upcoming run (slides)
  3. gluonraid3 issues (slides)
  4. Run Preparations
  5. Future Online Meetings
  6. AOT


Recharge Wednesday: - Ice cream Novelties -

Offsite option

Minutes

Attendees: David L. (chair). Sergey F., Dave A., Carl T., Vardan G., Bryan M., Simon T., Curtis M., Hovanes E., Graham H.

Announcements

  • First run meeting of the Spring 2017 run is tomorrow morning

DAQ specs for upcoming run

  • Current expectation is that we should be able to run DAQ at 50kHz during Spring run
    • This is 2.5 times larger than original spec. but previous experience drives this expectation
    • First limitation is rate we can write to disk. This is roughly 1GB/s
  • David showed a preliminary plot indicating this rate may be achievable with 1GB/s but further investigation is needed since this appears somewhat inconsistent with what was extrapolated from Spring 2016 data

gluonraid3 issues

  • Corrupted data observed on gluonraid3, partition 3
  • Fall run used partition3 and 4 for beam data
    • Each partition used 10 disks configured using software RAID0 and formatted with XFS
  • Most files currently on gluonraid3, partition 3 have different md5 checksum values compared to what is on tape
    • Looking closer at one file revealed the file on tape started with a valid EVIO block header while the copy on gluonraid3 was corrupted.
    • Timestamp of file on gluonraid3 was close to the timestamp of the mss stub file (from Dec. 16) suggesting file has not been modified.
  • Hovanes pointed out that the problem could be with the controller and not a disk
  • Current plan is to invest in 2 hardware RAID controller cards to replace the JBOD controllers
    • Will try and get them delivered and installed by end of next week
  • Plan B: Use gluonraid1 and gluonraid2 at 800MB/s
  • Plan C: Try higher level of software RAID
  • Plan D: Get ZFS configuration optimized
  • Sergey noted that if even if we do have RAID level redundancy for data protection, if a disk goes out the system performance will be degraded while it repairs itself, likely making it unusable for high rate I/O
  • Dave A. asked about spare disks for gluonraid3

Run Preparations

  • FCAL10 issue has been resolved (bad f250 module)
  • CODA 3.0.7 may be available by end of next week. It could provide better performance due to:
    • More efficient object allocation leading to less garbage collection
    • Multi-threaded event building

Future Online Meetings

  • Next meeting is in 2 weeks, but we will likely be having regular run meetings at that time and will start deferring the Online meetings until after the run.
  • David noted that the original intent of the Online meetings was to provide tight coordination between Controls, Trigger, DAQ, and all other online systems (monitoring, copy to tape, gluon computer admin, ...). At this point though these topics have naturally broken off into separate meetings. The exception being DAQ which is partially covered in the L1 meetings and partially in the Online meetings. The Online meetings are usually not attended by Sascha who is the primarily responsible person for a large part of the DAQ. Thus, we should think whether the format of this meeting is the most efficient.