Difference between revisions of "CMU Data Challenge 2"

From GlueXWiki
Jump to: navigation, search
Line 7: Line 7:
 
#* Still battling a scheduler issue. Work-around has been found.
 
#* Still battling a scheduler issue. Work-around has been found.
 
#* Running smoothly since ~Tuesday.
 
#* Running smoothly since ~Tuesday.
# As of 10:30am, 1174 jobs have completed.
+
# Final Tally: 7000 jobs:
#* 9001 Series - 859    1E7 with EM Background (25k Events Each) : 21.47 MEvents : 1 failure (DMagneticFieldMapFineMesh::GetFieldAndGradient())
+
#* 9001 Series - 5600  1E7 with EM Background (25k Events Each) : 139.87 MEvents : 1 failure (0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient())
#* 9002 Series - 225   5E7 with EM Background (10k Events Each) : 2.25 MEvents : 0 failures  
+
#* 9002 Series - 875   5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures  
#* 9003 Series - 90   without EM Background  (50k Events Each) : 4.45 MEvents : 1 failure (Job lost to the aether)
+
#* 9003 Series - 525   without EM Background  (50k Events Each) : 26.15 MEvents : 2 failures (1 Job lost to the aether (pbs fail), 0000392: timed-out ~9-10k events into hdgeant (96 hrs))

Revision as of 11:50, 14 April 2014

  1. At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).
  2. All 384 cores are reserved for the data challenge for three weeks.
  3. Did not switch to optional version.
  4. Start-up Problems
    • All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
    • Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
    • Still battling a scheduler issue. Work-around has been found.
    • Running smoothly since ~Tuesday.
  5. Final Tally: 7000 jobs:
    • 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents : 1 failure (0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient())
    • 9002 Series - 875 5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures
    • 9003 Series - 525 without EM Background (50k Events Each) : 26.15 MEvents : 2 failures (1 Job lost to the aether (pbs fail), 0000392: timed-out ~9-10k events into hdgeant (96 hrs))