Investigations and Repairs

From GlueXWiki
Revision as of 17:36, 8 February 2017 by Jzarling (Talk | contribs) (FCAL Replacement Base Criteria)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page is intended to be a reference for past and ongoing issues regarding base failures, investigations, and repairs. Dates are provided on some bullet points for extra clarity. An inventory of bases can be found at https://halldweb.jlab.org/JInventory/htdocs/list.php#. Contact Jon Zarling or Adesh Subedi for instructions on navigating JInventory.

Google Spreadsheet

HV Issues

  • See https://logbooks.jlab.org/book/hdfcal for a full list of base replacements.
  • Bases continue to experience HV failure at a rate of ~1 new base per day (4/11/16).
  • High temperature (in the ballpark of 35-40 C) is correlated with base HV failure. The cause and effect nature of is still not entirely clear at this point (4/11/16).
  • High current, with HV sagging by ~30 V is another common observation. These bases can continue to operate for months, though it is believed (at least by Adesh and Jon) that these will fail at some point. Unfortunately, no monitoring data exists to observe the moment of HV failure, to the best of Jon's awareness.
    • ONGOING: A setup in CEBAF can externally heat up to 12 bases to ~35 C at present. Currently this can only run during daytime hours. Over three eight hour periods no HV failures were seen among 12 recently removed from FCAL bases. Jon intends to investigate for a few days at 40 C before drawing conclusions.

Communications Issues

  • From Feb-April 8 2016 nine bases had severe communications issues on the FCAL with both EPICS and IU software. Adesh attempted to address them both individually and together with all other bases on strand. They gave no responses back for a matter of days. Power cycling via software controls did not help. However, when removed from the FCAL these bases appear to respond just fine (no long term monitoring has been performed yet). (4/11/16)
  • Bases that lose communication appear to be unable to receive messages, and work for some time when physically reset. Three bases with communication issues could not turn on LEDs. Once physically unplugged and plugged in again, they were able to turn on LEDs properly. (4/12/16).
  • At present it is only known that of three bases in the FCAL with persistent communication issues one stopped communication again within a week (4/18/16).
  • Two bases in the FCAL lost all communication, could not be fixed by any amount of power cycling. No response from SWIM cables. Bad COMM board suspected. Not sure if related to intermittent issues.

Remote Resetting

  • It has become clear that resetting or powering bases on or off via I/O port is NOT the same as physically unplugging and plugging back in. Previously we had thought and been told that these two methods should be completely equivalent. However, a number of things behave differently under these two methods of resetting. One example is that bases with communication issues have no change in behavior on a remote reset, however they become responsive upon a physical reset. This has not yet been explored thoroughly.

Incorrect SOCK/tran pin Issues

  • It was discovered that plugging the bottom row of the tran board to the top row of the sock board (leaving the top row of tran pins unconnected) led to significant issues (4/11/16).
    • This caused the back plate to become charged to beyond (-)60 V, presenting a nasty shock hazard.
    • Bases building up charge can also discharge to a neighboring base's back plate. This in turn can cause some nearby bases to reset (presumably due to rf noise).
    • These resetting bases caused base monitoring to become very difficult, particularly from late February to April 11 at JLab.
    • Some bases may be more prone to this issue than others, as the coax cable can sometimes guide the tran board to the incorrect sock pin slots.
  • At least one base was found to have this issue even with pins plugged in correctly. A bad sock or tran board may explain this.

Bootloader Issues

  • There were a number of issues when attempting to reprogram bases via the bootloader. No issues have been observed while reprogramming via SWIM cables. The bootloader scheme of reprogramming bases was devised by Dan Bennett, with some advice from one of the seller companies.
  • Sketch of bootloader checking procedure: the normal operating firmware is restricted to a certain address range. Other firmware exists for bootloader operation, but can only be altered when reprogrammed via swim cables (in theory).
    • The firmware is sent along CAN messages. Each message contains data, address to reprogram, and a checksum of the message. The base sums the message locally, and sends a response byte to indicate whether the local and server checksums match.
  • In practice, the checksum scheme is insufficient: a large number of issues occur. In some cases, one base can corrupt ALL other bases on a strand. From studies done last summer, it was determined that all bases cause failure if the upload procedure is performed enough times, and can lead to corruption of other bases. Somehow addresses outside the correct range are altered, and somehow bases mimic the server ID. Neither is well understood, one would think both should be astronomically unlikely if truly random.
  • Jon estimates two weeks of labor to upload a new firmware to the FCAL.

FCAL Replacement Base Criteria (updated 2/8/17)

  • Before adding to stack of replacement bases in F117:
    • JInventory double checked, MULT board history updated
    • Checked for pulses in dark box

IU Shipment Processing

  • This is mostly a personal reference for Jon
  • Bases are checked for:
    • Proper base ID and CAN ID in JInventory
    • Ability to communicate over CAN line
    • Proper back plate voltage (quick test to make sure tran board and sock boards are properly plugged in)