Difference between revisions of "CMU Data Challenge 2"

Revision as of 10:58, 28 March 2014

At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).
All 384 cores are reserved for the data challenge for three weeks.
Start-up Problems
- All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
- Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
- Still battling a scheduler issue. Work-around has been found.
- Running smoothly since ~Tuesday.
As of 10:30am, 1087 jobs have completed.
- 9001 Series - 859 1E7 with EM Background : 21.47 MEvents : 1 failure (DMagneticFieldMapFineMesh::GetFieldAndGradient())
- 9002 Series - 225 5E7 with EM Background : 2.25 MEvents : 0 failures
- 9003 Series - 90 without EM Background : 4.45 MEvents : 1 failure (Job lost to the aether)

@@ Line 2: / Line 2: @@
 # All 384 cores are reserved for the data challenge for three weeks.
 # Start-up Problems
+#* All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
 #* Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
 #* Still battling a scheduler issue. Work-around has been found.