HOWTO run jobs on the osg using the GlueX singularity container

From GlueXWiki
Jump to: navigation, search

What is the Gluex singularity container?

The Gluex singularity container replicates your local working environment on the Jlab CUE, including database files, executable binaries, libraries, and system packages on the remote site where your job runs. Singularity is an implementation of the "Linux container" concept that allows a user to bundle up applications, libraries, and the entire system directory structure that describes how you work on one system, and move the entire thing as a unit to another host where it can be started up as if it were running in the original context. In some ways this is similar to virtualization (eg. VirtualBox), except that it does not suffer from the inefficiencies of virtualization. From points of view of both computation speed and memory resources, processes running inside a Singuarity container are just as efficient as if you were to rebuild and run them in the local OS environment.

How do I submit a job to run in the container?

Here is an example job submission script for the OSG that uses the GlueX Singularity container maintained by Mark Ito on OSG network storage resources. These OSG storage resources are visible on Jlab machine scosg16.jlab.org at mount point /cvmfs. You can see the path under /cvmfs to the GlueX singularity container on the line SingularityImage in the submit script below. You should change the name of the osg proxy certificate in the line x509userproxy to point to your own proxy certificate, and make sure it has several hours left on it before you submit your jobs using condor_submit on scosg16. The local directory osg.d (or something you name yourself) should be created in your work directory, preferably under /osgpool/halld/userid, to receive the stdout and stderr logs from your jobs.

scosg16.jlab.org> cat my_osg_job.sub
executable = osg-container.sh
output = osg.d/stdout.$(PROCESS)
error = osg.d/stderr.$(PROCESS)
log = tpolsim_osg.log
notification = never
universe = vanilla
arguments = bash tpolsim_osg.bash $(PROCESS)
should_transfer_files = yes
x509userproxy = /tmp/x509up_u7896
transfer_input_files = tpolsim_osg.bash,setup_osg.sh,control.in0,control.in1,postconv.py,postsim.py
WhenToTransferOutput = ON_EXIT
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
on_exit_remove = true
RequestCPUs = 1
Requirements = HAS_SINGULARITY == True
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/markito3/gluex_docker_devel:latest"
+SingularityBindCVMFS = True
queue 10

The script osg-container.sh starts up the container and makes sure that it has all of the environment settings configured for the /group/halld work environment. Fetch a copy of this script from the git repository https://github.com/rjones30/gluex-osg-jobscripts.git and customize it for your own work. This script is nice because you can run it on scosg16 from your regular login shell and it will give you the same environment as would exist when your job starts on a remote osg site. For example, "./osg-container.sh bash" starts up a bash shell inside the container, where you can execute local GlueX commands as if you were on the ifarm -- apart from access to the central data areas and /cache of course.

How do I submit a job to run on OSG sites without singularity?

Singularity simplifies the writing and debugging of osg job scripts by allowing them to run in an environment that closely mimics the Jlab CUE. However, a significant fraction of the resources on the OSG do not have singularity installed, including a number of university clusters that provide opportunistic cycles to OSG users. As more OSG opportunistic jobs come to require singularity, an opportunity arises to gain access to under-subscribed sites by submitting jobs that do not require singularity. This section explains how to take your existing OSG workflow based on the Gluex singularity container and allow it to run on sites that do not have singularity installed, by making a couple simple changes to the submit file and job script. The example below illustrates the minimal changes to the above container-based osg job submission script that will enable it to run on non-singularity osg resources.

scosg16.jlab.org> cat my_nosg_job.sub
executable = osg-nocontainer.sh
output = osg0.d/stdout.$(PROCESS)
error = osg0.d/stderr.$(PROCESS)
log = tpolsim_osg.log
notification = never
universe = vanilla
arguments = bash tpolsim_osg.bash $(PROCESS)
should_transfer_files = yes
x509userproxy = /tmp/x509up_u7896
transfer_input_files = osg-nocontainer_2.29_jlab.env,osg-container-helper.sh,tpolsim_osg.bash,setup_osg.sh,control.in0,control.in1,postconv.py,postsim.py
WhenToTransferOutput = ON_EXIT
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
on_exit_hold = false
RequestCPUs = 1
Requirements = (HAS_CVMFS_oasis_opensciencegrid_org =?= True) && (HAS_CVMFS_singularity_opensciencegrid_org =?= True)
queue 10

Just a few minor changes were needed to the submit script to free the job from its dependence on singularity. First, the name of the main job executable changes from osg-container.sh to osg-nocontainer.sh, both of which are delivered as part of the rjones30/gluex-osg-jobscripts package from github. The second change is the addition of two new files to the transfer_input_files list. The first of these is osg-container-helper.sh, which is also part of the rjones30/gluex-osg-jobscripts package. No modifications to this script should be needed, except if you want to change the version/release of the Gluex software that your jobs are using, in which case you need to update all three: osg-container.sh, osg-nocontainer.sh, and osg-container-helper.sh, to change the value of "version" that is assigned in the script headers. The second of these is a custom environment script, named osg-nocontainer_2.29_jlab.env in the above example. You need to generate this script yourself prior to launching the job. On any machine with singularity installed, you can generate it by the command "bash osg-container.sh make.env". This special argument make.env is taken by the osg-container.sh script as a request to build the custom .env script for the version that is assigned in the header of the osg-container.sh script. Normally you would generate this .env script just once for a given version, and then use it for all of your osg jobs until you need to update your Gluex software version. The last change to the above submit script is the dropping of the HAS_SINGULARITY clause, and its replacement with the weaker requirements that the cvmfs filesystems for oasis and singularity are mounted.

In addition to the above changes to the submit file, one minor modification must be made to your own job execution scripts in order for them to run successfully from outside the container. At each line in your scripts where you execute a binary that resides inside the Gluex container or /group/halld area, it must be prefixed with $OSG_CONTAINER_HELPER. When run within osg-nocontainer.sh this environment variable translates to ./osg-container-helper.sh wrapper script, whereas within osg-container.sh it is the null string. This modification allows your script to run within either context without changes. For most binary executables, wrapping them inside $OSG_CONTAINER_HELPER is sufficient to make them run outside singularity as if they were run from inside the container. The root executable is an exception. If you want to run root in batch mode inside your scripts, you should invoke it as $OSG_CONTAINER_HELPER root.exe instead of merely as root.

How do I submit a job to run on OSG sites without /cvmfs?

This is also possible, but it requires some manual operations to be carried out on the head node of the site before the jobs are submitted. If you have access to a resource that would be useful for Gluex but does not have /cvmfs installed, write me an email and I will work with you to set it up.