Difference between revisions of "NU DC2 Tests"

From GlueXWiki
Jump to: navigation, search
Line 47: Line 47:
  
 
--[[User:Sdobbs|Sdobbs]] 16:51, 12 February 2014 (EST)
 
--[[User:Sdobbs|Sdobbs]] 16:51, 12 February 2014 (EST)
 +
 +
----
 +
 +
* Running 250 jobs of 1k each.
 +
* Caching resources (magnetic field) on disk, not just CCDB
 +
* Most REST jobs crashed, was running through gdb, most crashes seem to be at REST output functions.
 +
* Ran jobs again staging just REST output on local disk (some nodes don't have enough local disk to cache everything), and had success rate > 95% (though I did stop the jobs early).
 +
* Am running 50k jobs overnight.
 +
 +
--[[User:Sdobbs|Sdobbs]] 19:03, 12 February 2014 (EST)

Revision as of 20:03, 12 February 2014

  • I've been running jobs simulating 10K events using the same package versions as Mark described in the meeting on Friday.
  • The machines have 0.75 - 1.5 GB/core of memory.
  • There are no resource limits
  • I've gotten to a success rate of >50% (the exact number is uncertain since I was staging some of the intermediate files on disks local to the nodes, which would fill up sometimes).
  • Nearly all failures happened at the REST stage, and were usually due to a thread taking too long and being killed. I've increased the thread timeout to 90s, and this seems to help.
  • The REST processes do get up to 1.5-2 GB in size
  • The failed jobs do seem consistent with either hitting some events that take very long to reconstruct or being resource starved. I'm going to see what I can find out about the events on which the jobs die.
  • I'm also running jobs simulating 50K events to more closely reproduce Mark's results.

--Sdobbs 00:34, 10 February 2014 (EST)


  • Ran jobs overnight with 50k events each.
  • Killed ~1/3 after 13+ hours.
  • Success rate >40%, but lots of failures at mcsmear level this time
  • Jobs on new nodes were running fine, but older ones were clearly resource-constrained (likely memory)

--Sdobbs 14:36, 10 February 2014 (EST)


  • Ran jobs with 10K events each.
  • Success rate of 161/250 = 64%.
  • 18 jobs hung at beginning of REST generation, still in queue
  • ~10 jobs seemed to succeed but had REST files truncated, all from one node. also problem with d/l'ing magnetic field - seems to be a problem with disk space on the node?
  • most of the others died at the beginning of REST file generation without any useful info.
  • a few other misc failures
  • going to run the same with gdb to try and get some more useful info

--Sdobbs 15:15, 11 February 2014 (EST)


  • Two nodes hung in previous run (dreaded automounter hangings). Restarting 250 jobs w/ 10K events.

--Sdobbs 13:01, 12 February 2014 (EST)


  • 17 jobs crashed due to problems downloading the magnetic field files - curl download loses connection to squid proxy =( Will try running using CvmFS when configuration changes actually propogate?
  • No other crashes noted, ~3 REST files didn't copy over well for some reason (NFS problems?)
  • Could be some other throttling due to dumping MC tree for each event or running under gdb.
  • Am killing ~30 jobs which are taking > 3 hours to complete - this seems to happen due to EM background?
  • Need to investigate squid not caching correctly.

--Sdobbs 16:51, 12 February 2014 (EST)


  • Running 250 jobs of 1k each.
  • Caching resources (magnetic field) on disk, not just CCDB
  • Most REST jobs crashed, was running through gdb, most crashes seem to be at REST output functions.
  • Ran jobs again staging just REST output on local disk (some nodes don't have enough local disk to cache everything), and had success rate > 95% (though I did stop the jobs early).
  • Am running 50k jobs overnight.

--Sdobbs 19:03, 12 February 2014 (EST)