NU DC2 Tests
From GlueXWiki
- I've been running jobs simulating 10K events using the same package versions as Mark described in the meeting on Friday.
- The machines have 0.75 - 1.5 GB/core of memory.
- There are no resource limits
- I've gotten to a success rate of >50% (the exact number is uncertain since I was staging some of the intermediate files on disks local to the nodes, which would fill up sometimes).
- Nearly all failures happened at the REST stage, and were usually due to a thread taking too long and being killed. I've increased the thread timeout to 90s, and this seems to help.
- The REST processes do get up to 1.5-2 GB in size
- The failed jobs do seem consistent with either hitting some events that take very long to reconstruct or being resource starved. I'm going to see what I can find out about the events on which the jobs die.
- I'm also running jobs simulating 50K events to more closely reproduce Mark's results.
--Sdobbs 00:34, 10 February 2014 (EST)
- Ran jobs overnight with 50k events each.
- Killed ~1/3 after 13+ hours.
- Success rate >40%, but lots of failures at mcsmear level this time
- Jobs on new nodes were running fine, but older ones were clearly resource-constrained (likely memory)
--Sdobbs 14:36, 10 February 2014 (EST)
- Ran jobs with 10K events each.
- Success rate of 161/250 = 64%.
- 18 jobs hung at beginning of REST generation, still in queue
- ~10 jobs seemed to succeed but had REST files truncated, all from one node. also problem with d/l'ing magnetic field - seems to be a problem with disk space on the node?
- most of the others died at the beginning of REST file generation without any useful info.
- a few other misc failures
- going to run the same with gdb to try and get some more useful info