CMU Data Challenge 2

From GlueXWiki
Jump to: navigation, search
  1. At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).
  2. All 384 cores are reserved for the data challenge for three weeks.
  3. Did not switch to optional version.
  4. Start-up Problems
    • All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
    • Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
    • Still battling a scheduler issue. Work-around has been found.
    • Running smoothly since ~Tuesday.
  5. Final Tally: 7000 jobs, 3 failures:
    • 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure (small REST file):
      • 09001_0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient()
    • 9002 Series - 875 5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures
    • 9003 Series - 525 without EM Background (50k Events Each) : 26.15 MEvents : 2 failures:
      • 09003_0000014: lost to the aether (no record of it) (likely pbs fail)
      • 09003_0000392: timed-out ~9-10k events into hdgeant (96 hrs)