MantisBT

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000038Hall D OfflineGeneralpublic2011-01-07 23:102011-10-24 16:57
Reporterdavidl 
Assigned Tostaylor 
PrioritynormalSeveritymajorReproducibilitysometimes
StatusresolvedResolutionfixed 
PlatformOSOS Version
Summary0000038: Analysis program hangs on certain events
DescriptionJake Bennett reported this. It seems his analysis program hangs during the reconstruction of certain events. The whole JANA program doesn't hang, but the processing thread does and the program repeatedly reports that it "... hasn't reported in X seconds ...".
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0000053)
davidl (administrator)
2011-01-24 09:32

This is an ongoing issue. The IU group has continued to look into this as have I. The problem does seem to be related to a call to the STL sort algorithm in DTrackCandidate_factory_CDC.cc. This was identified by Matt S. and verified by David L. However, there is no evidence that the STL vector is corrupted going into the call and the likelihood this is a bug in sort itself seems unlikely.

The problem is currently being worked around in a couple of ways:

1. Compile DTrackCandidate_factory_CDC.cc with optimization turned off
2. Use a newer version of the code that uses the KalmanSIMD fitter rather than the ALT1 fitter.

I have also not ruled out that this has something to do with the JStreamLog mechanism in JANA. Since multiple threads use this, it could help explain the intermittent nature of the problem.
(0000063)
davidl (administrator)
2011-03-16 08:06

What appears to be the same issue has been report by Richard Jones at UConn. He had numerous GRID jobs hang with DTrackCandidate_factory_CDC appearing near the end of the stack trace before the seg. fault. I was able to reproduce the error as he described it using bggen produced events. A few things of note:

- I had to re-run the set of "bad" events used previously to look at this problem through hdgeant because the HDDM structure has changed since then. I was not able to get the hangs/crashes with the re-processed events

- When running through a freshly generated set of bggen events, I was able to get it to reliably hang/crash after about 4k events by activating only the DTrackCandidate:CDC factory (and it's inputs). I was able to cull out the single event causing the crash and am able to reliably crash or hang the program on it every time.

- At this point the problem with "sort" still seems to be there in that I can see it run past the end of the vector size.

It is still unclear though if the sort code itself is being corrupted, or is being poorly generated due to a bug in the optimizer, or the vector itself is being corrupted. Stepping through the asm may be required.
(0000064)
davidl (administrator)
2011-03-17 08:33

Richard Jones tracked down the source of this bug. See his full description here:
http://www.jlab.org/Hall-D/software/wiki/index.php/Diagnosing_segmentation_faults_in_reconstruction_software [^]

The short version is that the underlying assembly generated for our comparison routine (SortIntersections) copied one number out of and then back into the 80bit floating point unit before comparing it to what should have been the identical number. (The STL sort algorithm assumes when comparing an object with itself, it will never be "less than" itself). This caused a while loop to run away because it assumed at least one value in the array would not be less than the "pivot" value (a copy of one of the elements in the array). i.e. the bit shaving essentially led to A<A evaluating to true.
(0000068)
davidl (administrator)
2011-03-21 14:46

This issue was resolved a few days ago thanks to the heroic effort by Richard Jones to identify the source of the problem.
(0000180)
pmatt (developer)
2011-10-18 21:58

Stack Trace shows:
0000006 0x00000000005b6ab6 in DTrackFitterKalmanSIMD::FitTrack() ()
(0000185)
staylor (developer)
2011-10-24 16:49

This particular issue was resolved in March...

- Issue History
Date Modified Username Field Change
2011-01-07 23:10 davidl New Issue
2011-01-07 23:11 davidl Status new => assigned
2011-01-07 23:11 davidl Assigned To => davidl
2011-01-24 09:32 davidl Note Added: 0000053
2011-03-16 08:06 davidl Note Added: 0000063
2011-03-17 08:33 davidl Note Added: 0000064
2011-03-21 14:46 davidl Note Added: 0000068
2011-03-21 14:46 davidl Status assigned => resolved
2011-03-21 14:46 davidl Resolution open => fixed
2011-10-18 11:03 davidl Status resolved => assigned
2011-10-18 11:03 davidl Assigned To davidl => pmatt
2011-10-18 21:58 pmatt Note Added: 0000180
2011-10-18 21:58 pmatt Assigned To pmatt => staylor
2011-10-24 16:49 staylor Note Added: 0000185
2011-10-24 16:57 staylor Status assigned => resolved


Copyright © 2000 - 2024 MantisBT Team
Powered by Mantis Bugtracker