Difference between revisions of "NU DC2 Tests"

Revision as of 20:03, 12 February 2014

I've been running jobs simulating 10K events using the same package versions as Mark described in the meeting on Friday.
The machines have 0.75 - 1.5 GB/core of memory.
There are no resource limits
I've gotten to a success rate of >50% (the exact number is uncertain since I was staging some of the intermediate files on disks local to the nodes, which would fill up sometimes).
Nearly all failures happened at the REST stage, and were usually due to a thread taking too long and being killed. I've increased the thread timeout to 90s, and this seems to help.
The REST processes do get up to 1.5-2 GB in size

The failed jobs do seem consistent with either hitting some events that take very long to reconstruct or being resource starved. I'm going to see what I can find out about the events on which the jobs die.
I'm also running jobs simulating 50K events to more closely reproduce Mark's results.

--Sdobbs 00:34, 10 February 2014 (EST)

Ran jobs overnight with 50k events each.
Killed ~1/3 after 13+ hours.
Success rate >40%, but lots of failures at mcsmear level this time
Jobs on new nodes were running fine, but older ones were clearly resource-constrained (likely memory)

--Sdobbs 14:36, 10 February 2014 (EST)

Ran jobs with 10K events each.
Success rate of 161/250 = 64%.
18 jobs hung at beginning of REST generation, still in queue
~10 jobs seemed to succeed but had REST files truncated, all from one node. also problem with d/l'ing magnetic field - seems to be a problem with disk space on the node?
most of the others died at the beginning of REST file generation without any useful info.
a few other misc failures
going to run the same with gdb to try and get some more useful info

--Sdobbs 15:15, 11 February 2014 (EST)

Two nodes hung in previous run (dreaded automounter hangings). Restarting 250 jobs w/ 10K events.

--Sdobbs 13:01, 12 February 2014 (EST)

17 jobs crashed due to problems downloading the magnetic field files - curl download loses connection to squid proxy =( Will try running using CvmFS when configuration changes actually propogate?
No other crashes noted, ~3 REST files didn't copy over well for some reason (NFS problems?)
Could be some other throttling due to dumping MC tree for each event or running under gdb.
Am killing ~30 jobs which are taking > 3 hours to complete - this seems to happen due to EM background?
Need to investigate squid not caching correctly.

--Sdobbs 16:51, 12 February 2014 (EST)

Running 250 jobs of 1k each.
Caching resources (magnetic field) on disk, not just CCDB
Most REST jobs crashed, was running through gdb, most crashes seem to be at REST output functions.
Ran jobs again staging just REST output on local disk (some nodes don't have enough local disk to cache everything), and had success rate > 95% (though I did stop the jobs early).
Am running 50k jobs overnight.

--Sdobbs 19:03, 12 February 2014 (EST)

@@ Line 47: / Line 47: @@
 --[[User:Sdobbs|Sdobbs]] 16:51, 12 February 2014 (EST)
+----
+* Running 250 jobs of 1k each.
+* Caching resources (magnetic field) on disk, not just CCDB
+* Most REST jobs crashed, was running through gdb, most crashes seem to be at REST output functions.
+* Ran jobs again staging just REST output on local disk (some nodes don't have enough local disk to cache everything), and had success rate > 95% (though I did stop the jobs early).
+* Am running 50k jobs overnight.
+--[[User:Sdobbs|Sdobbs]] 19:03, 12 February 2014 (EST)