Difference between revisions of "NU DC2 Tests"

Revision as of 15:45, 11 February 2014

I've been running jobs simulating 10K events using the same package versions as Mark described in the meeting on Friday.
The machines have 0.75 - 1.5 GB/core of memory.
There are no resource limits
I've gotten to a success rate of >50% (the exact number is uncertain since I was staging some of the intermediate files on disks local to the nodes, which would fill up sometimes).
Nearly all failures happened at the REST stage, and were usually due to a thread taking too long and being killed. I've increased the thread timeout to 90s, and this seems to help.
The REST processes do get up to 1.5-2 GB in size

The failed jobs do seem consistent with either hitting some events that take very long to reconstruct or being resource starved. I'm going to see what I can find out about the events on which the jobs die.
I'm also running jobs simulating 50K events to more closely reproduce Mark's results.

--Sdobbs 00:34, 10 February 2014 (EST)

Ran jobs overnight with 50k events each.
Killed ~1/3 after 13+ hours.
Success rate >40%, but lots of failures at mcsmear level this time
Jobs on new nodes were running fine, but older ones were clearly resource-constrained (likely memory)

--Sdobbs 14:36, 10 February 2014 (EST)

@@ Line 24: / Line 24: @@
 * Ran jobs with 10K events each.
 * Success rate of 161/250 = 64%.
+* 18 jobs hung at beginning of REST generation