Difference between revisions of "HDDM Programmer's Interface"

Revision as of 08:47, 1 July 2016

Introduction

HDDM was introduced in the context of GlueX as a means to encode output from Monte Carlo simulations and results from their reconstruction. To understand why we needed something like HDDM, rather than going with a community-based standard such as HDF, see [1] below. That reference also contains a description of the design principles and requirements for the software package. The purpose of this wiki page is to give a quick-start guide for programmers that might want a way to write new hddm files or read data from existing files. The package comes with a set of tools and programmer interfaces that makes this very easy to do, particularly with python. The underlying implementation is in C++, so it provides good performance in terms of data rate to/from disk files with serial access. On-the-fly compression/decompression and automatic integrity verification are built into the package. Random-access to events at any location in a file without reading the entire file is also supported.

Templates and schemas

HDDM files are built from an xml template. A hddm template is a short xml document that describes the structure of one record in the hddm stream. Every hddm file has a copy of its template at the beginning, followed by its event data in a compact binary format. The template is what arranges those data into a meaningful structure. A simple example of a template is given below.

<?xml version="1.0" encoding="UTF-8"?>
<HDDM class="x" version="1.0" xmlns="http://www.gluex.org/hddm">
  <student name="string" minOccurs="0">
    <enrolled year="int" semester="int" maxOccurs="unbounded">
      <course credits="int" title="string" maxOccurs="unbounded">
        <result grade="string" Pass="boolean" />
      </course>
    </enrolled>
  </student>
</HDDM>

All of the events in the file represent repeats of this basic structure, with different values in the data fields. All actual data values are represented as attributes of tags. Attributes that are assigned type names ("string", "int", "long", "float", "double", "boolean", "anyURI", and "Particle_t") are user data. Any other values are treated as literal strings, and do not take up space in the file (other than in the template header). Some of these literal attributes function as metadata, eg. you might want to add an attribute unit="GeV" to document the units used for other attributes in a tag. Others like minOccurs/maxOccurs tell the data model whether a given element is always present in every event or may be omitted (minOccurs="0" indicates this) or whether it may be repeated any number of times (maxOccurs="unbounded" indicates this). The top-level element is special in that it must always be named HDDM and have the attributes shown above. The class attribute is an abbreviation that you chose for the data model you are creating. Chose a short, unique name for your class because it is used in filenames that are written by the hddm tools, and the abbreviation prevents collisions between files built from different templates (classes).

Templates provide an intuitive informal way of specifying the structure of a record in a hddm file. For most users, this is all they need to know about, but for those familiar with XML there is a more formal way to specify the structure of an xml document which is called a xml schema. HDDM uses schemas in two different ways. The first is to specify the structure of the templates themselves; the above template conforms to a schema called "http://www.gluex.org/hddm" (hint: this is not a URL to anywhere, it is a URI known as an 'xml namespace'), as indicated in the xmlns attribute in the HDDM tag of the template. The schema for this document type is found in hddm_schema.xsl in the main hddm directory of the distribution. The second use of schemas is that every hddm file is itself a valid xml document, so it needs a schema against which it can be verified. The hddm toolkit provides a pair of tools hddm-schema and schema-hddm that convert back and forth between templates and schema. The two are equivalent ways of representing the same information about the structure of a hddm record, with the schema being more complete and standards-based, while the template is much shorter and more intuitive to most users. Schemas provide a much more general set of constraints that can be expressed for the data and relationships between them, but experience has shown that their practical use for this purpose is very limited, except for specialists. For the remainder of this document, we will deal only with templates.

How to build

The hddm toolkit is distributed as a part of the GlueX sim-recon distribution. The sim-recon distribution is distributed from the github repository as JeffersonLab/sim-recon. Instructions for how to download and build sim-recon are given elsewhere on this wiki. The hddm tools are located in sim-recon/src/programs/Utilities/hddm. Checking out the repository, setting up your build environment, and executing "scons -u install" from sim-recon/src/Utilities/hddm should be all that is needed to build the hddm toolkit. Before continuing to read this document, make sure that the basic tools like hddm-xml, xml-hddm, hddm-c, hddm-cpp, hddm-py, and xml-xml are in your shell PATH. These tools are not the hddm libraries themselves, but the tools you need to build the libraries from a template.

Before you can begin to work with hddm files, you need a template. There is a template at the head of every hddm file, so if you have a hddm file that has already been created that you want to work with, simply extract the header using a text editor and save it to a file with the extension ".xml". Another way to get started would be to copy/paste the above example template into a file "exam2.xml" (or copy it from the distribution directory). The instructions that follow assume that you have done this. Now it is time to build a hddm i/o library to let you read and write hddm data records. Currently there are 3 programming languages supported by hddm: python, C++, and c. Python is the least verbose and most readable interface, so let's begin with that.

HDDM in python

If you have access to a hddm file that was written by someone else, copy it into your work directory and use a text editor to extract the header into a file, which you may call "exam.xml". Use the following commands to build the python module that you will need to read the contents of this file.

$ hddm-cpp exam.xml
$ hddm-py exam.xml
$ python setup_hddm_X.py build -b build_hddm_X

In this example, I assumed that the HDDM "class" letter (see the HDDM tag in your template header) was "X". You should change it to whatever the actual class abbreviation is for your hddm file. The above steps should create a shared library that starts with hddm_X in your work directory. Copy that module to a place in your PYTHONPATH where you usually place your private python modules, then execute the following program to print the contents of your hddm file in plain text. I assumed it was called "exam.hddm".

import hddm_X
for rec in hddm_X.istream("exam.hddm"):
   print rec

To see the same data printed out as a properly formatted xml document, replace the "print rec" with "print rec.toXML()".

HDDM in C++

HDDM in c

References

[1] HDDM "Hall D Data Model", R.T. Jones, Gluex-doc-065, September 23, 2003.

Difference between revisions of "HDDM Programmer's Interface"

Revision as of 08:47, 1 July 2016

Contents

Introduction

Templates and schemas

How to build

HDDM in python

HDDM in C++

HDDM in c

References

Navigation menu

Views

Personal tools

Navigation

Search

Tools