|Anonymous | Login | Signup for a new account||2020-08-06 09:55 EDT|
|My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000038||Hall D Offline||General||public||2011-01-07 23:10||2011-10-24 16:57|
|Summary||0000038: Analysis program hangs on certain events|
|Description||Jake Bennett reported this. It seems his analysis program hangs during the reconstruction of certain events. The whole JANA program doesn't hang, but the processing thread does and the program repeatedly reports that it "... hasn't reported in X seconds ...".|
|Tags||No tags attached.|
This is an ongoing issue. The IU group has continued to look into this as have I. The problem does seem to be related to a call to the STL sort algorithm in DTrackCandidate_factory_CDC.cc. This was identified by Matt S. and verified by David L. However, there is no evidence that the STL vector is corrupted going into the call and the likelihood this is a bug in sort itself seems unlikely.
The problem is currently being worked around in a couple of ways:
1. Compile DTrackCandidate_factory_CDC.cc with optimization turned off
2. Use a newer version of the code that uses the KalmanSIMD fitter rather than the ALT1 fitter.
I have also not ruled out that this has something to do with the JStreamLog mechanism in JANA. Since multiple threads use this, it could help explain the intermittent nature of the problem.
What appears to be the same issue has been report by Richard Jones at UConn. He had numerous GRID jobs hang with DTrackCandidate_factory_CDC appearing near the end of the stack trace before the seg. fault. I was able to reproduce the error as he described it using bggen produced events. A few things of note:
- I had to re-run the set of "bad" events used previously to look at this problem through hdgeant because the HDDM structure has changed since then. I was not able to get the hangs/crashes with the re-processed events
- When running through a freshly generated set of bggen events, I was able to get it to reliably hang/crash after about 4k events by activating only the DTrackCandidate:CDC factory (and it's inputs). I was able to cull out the single event causing the crash and am able to reliably crash or hang the program on it every time.
- At this point the problem with "sort" still seems to be there in that I can see it run past the end of the vector size.
It is still unclear though if the sort code itself is being corrupted, or is being poorly generated due to a bug in the optimizer, or the vector itself is being corrupted. Stepping through the asm may be required.
Richard Jones tracked down the source of this bug. See his full description here:
The short version is that the underlying assembly generated for our comparison routine (SortIntersections) copied one number out of and then back into the 80bit floating point unit before comparing it to what should have been the identical number. (The STL sort algorithm assumes when comparing an object with itself, it will never be "less than" itself). This caused a while loop to run away because it assumed at least one value in the array would not be less than the "pivot" value (a copy of one of the elements in the array). i.e. the bit shaving essentially led to A<A evaluating to true.
|This issue was resolved a few days ago thanks to the heroic effort by Richard Jones to identify the source of the problem.|
Stack Trace shows:
0000006 0x00000000005b6ab6 in DTrackFitterKalmanSIMD::FitTrack() ()
|This particular issue was resolved in March...|
|2011-01-07 23:10||davidl||New Issue|
|2011-01-07 23:11||davidl||Status||new => assigned|
|2011-01-07 23:11||davidl||Assigned To||=> davidl|
|2011-01-24 09:32||davidl||Note Added: 0000053|
|2011-03-16 08:06||davidl||Note Added: 0000063|
|2011-03-17 08:33||davidl||Note Added: 0000064|
|2011-03-21 14:46||davidl||Note Added: 0000068|
|2011-03-21 14:46||davidl||Status||assigned => resolved|
|2011-03-21 14:46||davidl||Resolution||open => fixed|
|2011-10-18 11:03||davidl||Status||resolved => assigned|
|2011-10-18 11:03||davidl||Assigned To||davidl => pmatt|
|2011-10-18 21:58||pmatt||Note Added: 0000180|
|2011-10-18 21:58||pmatt||Assigned To||pmatt => staylor|
|2011-10-24 16:49||staylor||Note Added: 0000185|
|2011-10-24 16:57||staylor||Status||assigned => resolved|
|Copyright © 2000 - 2020 MantisBT Team|