MantisBT - Hall D Offline
View Issue Details
0000038Hall D OfflineGeneralpublic2011-01-07 23:102011-10-24 16:57
davidl 
staylor 
normalmajorsometimes
resolvedfixed 
0000038: Analysis program hangs on certain events
Jake Bennett reported this. It seems his analysis program hangs during the reconstruction of certain events. The whole JANA program doesn't hang, but the processing thread does and the program repeatedly reports that it "... hasn't reported in X seconds ...".
No tags attached.
Issue History
2011-01-07 23:10davidlNew Issue
2011-01-07 23:11davidlStatusnew => assigned
2011-01-07 23:11davidlAssigned To => davidl
2011-01-24 09:32davidlNote Added: 0000053
2011-03-16 08:06davidlNote Added: 0000063
2011-03-17 08:33davidlNote Added: 0000064
2011-03-21 14:46davidlNote Added: 0000068
2011-03-21 14:46davidlStatusassigned => resolved
2011-03-21 14:46davidlResolutionopen => fixed
2011-10-18 11:03davidlStatusresolved => assigned
2011-10-18 11:03davidlAssigned Todavidl => pmatt
2011-10-18 21:58pmattNote Added: 0000180
2011-10-18 21:58pmattAssigned Topmatt => staylor
2011-10-24 16:49staylorNote Added: 0000185
2011-10-24 16:57staylorStatusassigned => resolved

Notes
(0000053)
davidl   
2011-01-24 09:32   
This is an ongoing issue. The IU group has continued to look into this as have I. The problem does seem to be related to a call to the STL sort algorithm in DTrackCandidate_factory_CDC.cc. This was identified by Matt S. and verified by David L. However, there is no evidence that the STL vector is corrupted going into the call and the likelihood this is a bug in sort itself seems unlikely.

The problem is currently being worked around in a couple of ways:

1. Compile DTrackCandidate_factory_CDC.cc with optimization turned off
2. Use a newer version of the code that uses the KalmanSIMD fitter rather than the ALT1 fitter.

I have also not ruled out that this has something to do with the JStreamLog mechanism in JANA. Since multiple threads use this, it could help explain the intermittent nature of the problem.
(0000063)
davidl   
2011-03-16 08:06   
What appears to be the same issue has been report by Richard Jones at UConn. He had numerous GRID jobs hang with DTrackCandidate_factory_CDC appearing near the end of the stack trace before the seg. fault. I was able to reproduce the error as he described it using bggen produced events. A few things of note:

- I had to re-run the set of "bad" events used previously to look at this problem through hdgeant because the HDDM structure has changed since then. I was not able to get the hangs/crashes with the re-processed events

- When running through a freshly generated set of bggen events, I was able to get it to reliably hang/crash after about 4k events by activating only the DTrackCandidate:CDC factory (and it's inputs). I was able to cull out the single event causing the crash and am able to reliably crash or hang the program on it every time.

- At this point the problem with "sort" still seems to be there in that I can see it run past the end of the vector size.

It is still unclear though if the sort code itself is being corrupted, or is being poorly generated due to a bug in the optimizer, or the vector itself is being corrupted. Stepping through the asm may be required.
(0000064)
davidl   
2011-03-17 08:33   
Richard Jones tracked down the source of this bug. See his full description here:
http://www.jlab.org/Hall-D/software/wiki/index.php/Diagnosing_segmentation_faults_in_reconstruction_software [^]

The short version is that the underlying assembly generated for our comparison routine (SortIntersections) copied one number out of and then back into the 80bit floating point unit before comparing it to what should have been the identical number. (The STL sort algorithm assumes when comparing an object with itself, it will never be "less than" itself). This caused a while loop to run away because it assumed at least one value in the array would not be less than the "pivot" value (a copy of one of the elements in the array). i.e. the bit shaving essentially led to A<A evaluating to true.
(0000068)
davidl   
2011-03-21 14:46   
This issue was resolved a few days ago thanks to the heroic effort by Richard Jones to identify the source of the problem.
(0000180)
pmatt   
2011-10-18 21:58   
Stack Trace shows:
0000006 0x00000000005b6ab6 in DTrackFitterKalmanSIMD::FitTrack() ()
(0000185)
staylor   
2011-10-24 16:49   
This particular issue was resolved in March...