NU DC2 Tests

I've been running jobs simulating 10K events using the same package versions as Mark described in the meeting on Friday.
The machines have 0.75 - 1.5 GB/core of memory.
There are no resource limits
I've gotten to a success rate of >50% (the exact number is uncertain since I was staging some of the intermediate files on disks local to the nodes, which would fill up sometimes).
Nearly all failures happened at the REST stage, and were usually due to a thread taking too long and being killed. I've increased the thread timeout to 90s, and this seems to help.
The REST processes do get up to 1.5-2 GB in size

The failed jobs do seem consistent with either hitting some events that take very long to reconstruct or being resource starved. I'm going to see what I can find out about the events on which the jobs die.
I'm also running jobs simulating 50K events to more closely reproduce Mark's results.

--Sdobbs 00:34, 10 February 2014 (EST)

Ran jobs overnight with 50k events each.
Killed ~1/3 after 13+ hours.
Success rate >40%, but lots of failures at mcsmear level this time
Jobs on new nodes were running fine, but older ones were clearly resource-constrained (likely memory)

--Sdobbs 14:36, 10 February 2014 (EST)

Ran jobs with 10K events each.
Success rate of 161/250 = 64%.
18 jobs hung at beginning of REST generation, still in queue
~10 jobs seemed to succeed but had REST files truncated, all from one node. also problem with d/l'ing magnetic field - seems to be a problem with disk space on the node?
most of the others died at the beginning of REST file generation without any useful info.
a few other misc failures
going to run the same with gdb to try and get some more useful info

--Sdobbs 15:15, 11 February 2014 (EST)

Two nodes hung in previous run (dreaded automounter hangings). Restarting 250 jobs w/ 10K events.

--Sdobbs 13:01, 12 February 2014 (EST)

17 jobs crashed due to problems downloading the magnetic field files. Will try running using CvmFS next time.

Navigation menu