Guide to roll-your-own python hddm transforms

From GlueXWiki
Revision as of 07:09, 25 July 2018 by Jonesrt (Talk | contribs) (example 3: stripping out bits of hit information)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

At one time, if you wanted to catenate two similar hddm files, you might do it with "hddmcat", or maybe with "hddmcp", or "hddmcp_c". Or if you have been in the collaboration long enough, you might remember our old friend "hddm_merge_files". Each of these utilities had a specific purpose at one time, no two of them are quite identical. There are others that do slightly different things, such as "hddm-xml", "xml-hddm", "hddm_cull_events", etc. The purpose of this page is to convince you that ALL of these are obsolete, now that the hddm python module is available. Using the python shell, you can perform a much more general set of actions on hddm input files, including merging, splitting based on event inspection, and general transforms of the event content. All of this is illustrated with examples below.

First, you need to make sure that your shell environment is properly configured for using the standard GlueX software stack. To verify this, try the following command at your linux shell prompt. If this command returns without an error, you are all set to go. If it complains about the hddm_s module not being found, you need to find the appropriate setup_env script for your login shell, source it and try again.

python -c 'import hddm_s'

example 0: listing hddm files

The first thing one would like to do is to browse the contents of a hddm file. Simply listing it with "more" shows the record structure of the data stored in the file, but not the actual values. There well-known "hddm-xml" utility can be used to render the contents of any hddm file in xml format. This utility is unique and will not be going away because it can read any hddm file, including types for which a python module does not yet exist. The hddm utilities package contains tools for creating the python module automatically from any hddm file, but if all you want to do is to list the contents of a file, you can use the hddm-xml utility without first building the python module. Nevertheless, using the python interface to render hddm files in ordinary text is still useful for a couple of reasons, (1) because it is usually the first step when doing anything else with the contents of a file, as a check that you are decoding it successfully, and (2) the python interface is an order of magnitude faster at converting hddm to xml than the hddm-xml utility. The first example below outputs the contents of a hddm_s file in xml, while the second renders it in something closer to a plain text listing format.

import hddm
for rec in hddm_s.istream("t1.hddm"):
  print rec.toXML()
import hddm
for rec in hddm_s.istream("t1.hddm"):
  print rec.toString()

example 1: merging files

In the first example, we learn how to catenate two hddm files named "t1.hddm" and "t2.hddm" into a single output file "tcat.hddm" with a single command.

python -c 'import hddm_s;fout=hddm_s.ostream("tcat.hddm");[[fout.write(rec) for rec in hddm_s.istream(fin)] for fin in ("t1.hddm","t2.hddm")]'

This is more readable if we save it to a file and write it on multiple lines, as follows.

import hddm_s
fout=hddm_s.ostream("tcat.hddm")
fout.compression = hddm_s.k_z_compression
for fin in ("t1.hddm","t2.hddm"):
  for rec in hddm_s.istream(fin):
    fout.write(rec)

In this example, I introduced one new feature, the ability to turn on compression in an output stream simply by assigning one of the predefined constants to the compression attribute. If you save the above program to a file x.py and execute it as "python x.py" then it will produce a compressed version of tcat.hddm that you created earlier with the one-liner above. Clearly I can extend the number of input files by adding to the list ("t1.hddm", "t2.hddm"). Alternately, I can specify a list of input files on the command line and replace "("t1.hddm", "t2.hddm") with sys.argv[1:]. You would also need to add "import sys" to the file header.

If you are a python user then you know that the command "pydoc hddm_s" gives you a ready-made man page-style documentation on all of the object types and functions provided by the hddm_s module. The "_s" in hddm_s stands for simulation hddm files. For working with REST files, you would use hddm_r. Other than the one-letter difference, the usage is same.

example 2: splitting files

Suppose you have generated a large bggen sample, and you want to split it into separate datasets depending on how many pi0's there are in each event. The following example does this by looking in the Monte Carlo record at the head of each event and counting the number of pi0's that it finds there. The particle types and their corresponding integer codes are listed in the header file particleType.h in the GlueX standard includes. If you change the "t1.hddm" to the name of your input hddm file and run it, it will write out separate files with names like "sample_1pi0.hddm", "sample_5pi0.hddm", etc., according to how many pi0s there were in the event. Note that if your Monte Carlo event file contains unstable particles that decay into pi0's then where an event ends after the split may depend on whether you run it on the file upstream or downstream from hdgeant/hdgeant4.

import hddm_s
pi0 = 7
fout = {}
for rec in hddm_s.istream("t1.hddm"):
  npi0=0
  for prod in rec.getProducts():
    if prod.type == pi0:
      npi0 += 1
  if not npi0 in fout:
    fout[npi0] = hddm_s.ostream("sample_{0}pi0.hddm".format(npi0))
  fout[npi0].write(rec)

example 3: stripping out bits of hit information

Suppose you created a simulation whose output contained some surprising features, and you want to simulate it again with different random numbers to check if what you saw was a statistical fluke. The following example can be used to strip out all of the simulated hits in the output file from hdgeant/hdgeant4, so that the exact same set of events can be fed back into the simulation as if they were coming from a Monte Carlo generator. This Monte-Carlo-only hddm file can then be fed either to hdgeant or hdgeant4 multiple times, to check for systematic differences, or dependency on the starting random number seed. Note that if you comment out the line below removing the <random> tag from the output event, then the simulation should repeat itself and generate identical results each time it is run. This provides a useful check on the repeatability of the simulation.

import hddm_s
fout = hddm_s.ostream("t0.hddm")
for rec in hddm_s.istream("t1.hddm"):
   rec.getPhysicsEvents()[0].deleteHitViews()
   # only do the following if you want to strip out all except the primary vertex
   if len(rec.getVertices()) > 1:
      rec.getReactions()[0].deleteVertices(-1,1) # everything (-1) starting at 1
   # only do the following if you want to simulate with completely fresh random seesds
   rec.getReactions()[0].deleteRandoms()
   fout.write(rec)

example 4: reordering your events

Suppose you are running a hdgeant4 simulation with multiple threads, and then you discover (as you will) that the events in the output file are not in the same order as they were in the input file. Most of the time that is not a problem, but it might be in special circumstances, eg. if you want to do a one-to-one comparison between results from hdgeant and hdgeant4. The following example illustrates how easy it is to write a python event sorter that can read the disordered file and write out an ordered file.

import hddm_s
fout = hddm_s.ostream("t1s.hddm")
fout.compression = hddm_s.k_z_compression #optional
events = {}
next_event = 1
for rec in hddm_s.istream("t1.hddm"):
   event = rec.getPhysicsEvent().eventNo
   if event == next_event:
      fout.write(rec)
      next_event += 1
      while next_event in events:
         fout.write(events.pop(next_event))
         next_event += 1
   else:
      events[event] = rec
print len(events), "leftover events"

example 5: writing Monte Carlo events

Suppose you have your own event generator that you want to use with hdgeant/hdgeant4, but the output from the generator is in plain text. This example provides a bare-bones starting point for writing your own converter to translate from whatever ascii format your generator uses to the hddm format that is recognized by hdgeant/hdgeant4. The example code in this section imagines a mythical event generator of a putative B boson that is photoproduced by a beam photon in the GlueX target and decays within a short time into a e+e- pair. A plain text description of an event from this mythical generator might be as follows

event 15438
beam photon energy: 14.3928 MeV
beam photon momentum: (-5.79431e-05,1.6068e-04,14.3928) MeV/c
beam photon polarization: (0.0477761,-0.062995,8.95607e-7)
target proton momentum: (0,0,0) MeV/c
B energy: 14.3928 MeV
B momentum: (1.519e-6,-1.33547,14.2379) MeV/c
B decay lifetime: 1.010 ns
B decay displacement: (-5.82e-1,4.00e-3,8.776) cm
B decay product 1: positron
B decay product 1 energy: 3.67671 MeV
B decay product 1 momentum: (-4.51987e-1,-2.30665e-1,3.6055) MeV/c
B decay product 2: electron
B decay product 2 energy: 10.7161 MeV
B decay product 2 momentum: (5.53506e-1,-1.10481,10.6324) MeV/c

The following python code illustrates how this input stream might be read in and translated into the hddm_s format and written out, to be later used as input by hdgeant/hdgeant4. This example code looks very long and complicated, but most of it is simply parsing the input text file. It should be easy to generalize this example to your specific ascii generator.

Photon = 1
Proton = 14
Positron = 2
Electron = 3
MeV = 0.001 #GeV
import hddm_s, fileinput, re
fout = hddm_s.ostream("tgen.hddm")
event = None
for line in fileinput.input():
  sline = line.rstrip().split()
  if sline[0] == "event":
    if event:
      fout.write(event)
    event = hddm_s.HDDM()
    pev = event.addPhysicsEvents(1)
    pev[0].runNo = 99999
    pev[0].eventNo = int(sline[1])
    rea = pev[0].addReactions(1)
  elif sline[0] == "beam" and sline[1] == "photon":
    if sline[2] == "energy:":
      bea = rea[0].addBeams(1)
      bea[0].type = Photon
      mom = bea[0].addMomenta(1)
      mom[0].E = float(sline[3]) * MeV
    elif sline[2] == "momentum:":
      mat = re.match(r"\((.*),(.*),(.*)\)", sline[3])
      mom[0].px = float(mat.group(1)) * MeV
      mom[0].py = float(mat.group(2)) * MeV
      mom[0].pz = float(mat.group(3)) * MeV
    elif sline[2] == "polarization:":
      pol = bea[0].addPolarizations(1)
      mat = re.match(r"\((.*),(.*),(.*)\)", sline[3])
      pol[0].Px = float(mat.group(1))
      pol[0].Py = float(mat.group(2))
      pol[0].Pz = float(mat.group(3))
  elif sline[0] == "target" and sline[1] == "proton":
    tar = rea[0].addTargets(1)
    tar[0].type = Proton
    if sline[2] == "momentum:":
      mom = tar[0].addMomenta(1)
      mat = re.match(r"\((.*),(.*),(.*)\)", sline[3])
      mom[0].E = 0.93827231
      mom[0].px = float(mat.group(1)) * MeV
      mom[0].py = float(mat.group(2)) * MeV
      mom[0].pz = float(mat.group(3)) * MeV
  elif sline[0] == "B":
    if sline[1] == "energy:":
      vtx = rea[0].addVertices(1)
      ori = vtx[0].addOrigins(1)
      pro = vtx[0].addProducts(1)
      pro[0].decayVertex = 0
      pro[0].id = 1
      pro[0].type = -12345
      mom = pro[0].addMomenta(1)
      mom[0].E = float(sline[2]) * MeV
    elif sline[1] == "momentum:":
      mat = re.match(r"\((.*),(.*),(.*)\)", sline[2])
      mom[0].px = float(mat.group(1)) * MeV
      mom[0].py = float(mat.group(2)) * MeV
      mom[0].pz = float(mat.group(3)) * MeV
    elif sline[1] == "decay":
      if sline[2] == "lifetime:":
        vtx = rea[0].addVertices(1)
        ori = vtx[0].addOrigins(1)
        ori[0].t = float(sline[3])
      elif sline[2] == "displacement:":
        mat = re.match(r"\((.*),(.*),(.*)\)", sline[3])
        ori[0].vx = float(mat.group(1))
        ori[0].vy = float(mat.group(2))
        ori[0].vz = float(mat.group(3))
      elif sline[2] == "product":
        mat = re.match(r"([0-9]+):", sline[3])
        if mat:
           id = int(mat.group(1))
           pro = vtx[0].addProducts(1)
           pro[0].decayVertex = 1
           pro[0].id = id + 1
           pro[0].mech = 0xdeadface
           pro[0].parentid = 1
           if sline[4] == "electron":
             pro[0].pdgtype = 11
             pro[0].type = Electron
           elif sline[4] == "positron":
             pro[0].pdgtype = -11
             pro[0].type = Positron
           elif sline[4] == "photon":
             pro[0].pdgtype = 22
             pro[0].type = Photon
        elif sline[4] == "energy:":
          mom = pro[0].addMomenta(1)
          mom[0].E = float(sline[5]) * MeV
        elif sline[4] == "momentum:":
          mat = re.match(r"\((.*),(.*),(.*)\)", sline[5])
          mom[0].px = float(mat.group(1)) * MeV
          mom[0].py = float(mat.group(2)) * MeV
          mom[0].pz = float(mat.group(3)) * MeV
if event:
  fout.write(event)

example 6: mining data, making histograms

Suppose you see what looks like a non-physical effect in the output from analysis of a Monte Carlo dataset, and you want to determine where the effect originates, whether in the event generator, the simulation, in mcsmear, or in the reconstruction libraries. This example shows how you can inspect the contents of the hddm events at each point along the Monte Carlo data path and produce plots that can yield clues as to the origin of the effect. This example illustrates how you might look at the timing spectra of the hits in the TOF counter relative to the time of the vertex.

import hddm_s
from ROOT import *
MeV = 1e3
def thist(name, hfile):
   h = TH1D(name, "", 200, 20, 40)
   h.GetXaxis().SetTitle("${\Delta (t_{hit} - t_{bunch})} (ns)")
   h.GetYaxis().SetTitle("hits")
   h.GetYaxis().SetTitleOffset(1.4)
   for rec in hddm_s.istream(hfile):
      tvtx = rec.getVertices()[0].getOrigins()[0].t
      for bar in rec.getFtofCounters():
         tleft = 1e9
         tright = 1e9
         for hit in bar.getFtofHits():
            if hit.end == 0:
               tleft = hit.t if hit.t < tleft else tleft
            else:
               tright = hit.t if hit.t < tright else tright
         tmean = (tleft + tright)/2
         if tmean < 1e3:
            h.Fill(tmean - tvtx)
   return h
ht5 = thist("ht5", "test5_smeared.hddm")
ht5.Draw()