Running jobs on the grid
This page is to be used for documenting the recent work done with getting jobs to run on the open science grid. The following are the steps and issues encountered. A HowTo exists for getting started running jobs on the grid. You should look there for instructions and examples of how to do what is being done here.
One of the goals for using the open science grid (OSG) is to have a large set of generic background MC that may be used for various analyses. Ideally, this MC set may be re-generated easily whenever a major update occurs in the hall D simulation and reconstruction code. The most efficient way to store this data right now is to keep only the reconstructed hddm files, which may be analyzed with hall D or your own analysis code on the grid. The resultant root files can then be downloaded to a local machine.
The following steps are general, but are used in the analysis of &gamma p -> &pi+ &pi- &pi+ n.
We begin by building the hall D source code and any analysis code on a designated space on the grid. In this particular instance, that space is /nfs/direct/apps/Gluex/pi-pi-pi-n. You should note that bggen and any necessary plugins (danahddm) must be built in addition to the hall D source code. This can be done using submission and executable files like those you can find on the HowTo.
Once the source code is built, we use an executable called run_sim.sh (plus two arguments - random number seed and number of generated events) to run bggen, hdgeant and mcsmear. The resultant hdgeant_smeared.hddm file is sent to the grid storage space. Afterwards, this executable links into the next, called run_ana.sh, which runs the hd_ana program with the danahddm plugin. This is an empty analysis, but the result is that the reconstruction of events is saved to a new file called dana_events.hddm, which is sent to storage. Again, this links into another executable, run_3pi.sh, which runs my specific analysis program and outputs a root file. All unnecessary output is destroyed.
Once all outstanding issues have been addressed, we would like to generate some realistic background samples. The total cross section for Gluex is about 120 µ b. This channel has a cross section of about 3.2 µb(gluex-doc-856-v1). With a photon beam flux of 107/s and cross sectional area of the target of 1.26 b-1, a bggen sample size of 30 million events (signal size of 800,000 events) equates to about 104 s of run time. We decided we will run 600 jobs of 50,000 events each.
The following are problems that we have encountered that are still outstanding.
Crashes: It appears as though we can make it through mcsmear with some reliability (not 100%), but we see crashes during reconstruction. The errors include malloc(): memory corruption and seg faults (TVector3 Perp()). This results in dana_events.hddm files that are not complete (smaller than the hdgeant_smeared.hddm files).