Difference between revisions of "HOWTO get your jobs to run on the Grid"

From GlueXWiki
Jump to: navigation, search
m ('''Setting up the OSG client''')
(Step 2: Registering for the Gluex VO)
 
(33 intermediate revisions by 5 users not shown)
Line 1: Line 1:
== Introduction ==
+
== '''Note on this page''' ==
 +
The contents of this page were last updated in 2010.
 +
As of May 2014, the contents of [https://halldweb.jlab.org/wiki/index.php/Using_the_Grid Using the Grid] supersedes this page.
  
What I'm going to outline below is how I got my jobs to run on the Grid and what I need to do it. I'll try to include documentation where I can and fixes for other OS's. My OSG client machine  is a Debbian Lenny distro here at the University of Regina. It was a random machine I had available but the OSG software works fine on it. (It didn't on my Mandriva 2010 distro on my desktop.) This was a month's worth of trial and error fixing bugs and firewall issues with Richard Jones, but we have most of them worked out and everything has been running fine for the last week or so.
+
== '''Introduction''' ==
  
 +
What I'm going to outline below is how I got my jobs to run on the Grid and what I need to do it. I'll try to include documentation where I can and fixes for other OS's. My OSG client machine  is a Debian Lenny distro here at the University of Regina. It was a random machine I had available but the OSG software works fine on it. (It didn't on my Mandriva 2010 distro on my desktop.) This was a month's worth of trial and error fixing bugs and firewall issues on the Grid with Richard Jones, but we have most of them worked out and everything has been running fine for the last week or so.
 +
 +
Updates to Blake's HOWTO are being made by Jake.
  
 
== '''Step 1: Getting your Grid Certificate''' ==
 
== '''Step 1: Getting your Grid Certificate''' ==
Line 8: Line 13:
 
The security for the grid is quite robust and as such, a signed certificate from a known signing authority is needed. As I am working in Canada, I used [http://westgrid.ca/ Westgrid] to get my certificate from [http://www.gridcanada.ca/ca/index.html Grid Canada] under a project already registered at the UofR. You will require a sponsor who will verify that you are part of their project if you are not the project leader. It took 2-3 weeks to get my certificate  since people at Westgrid were on holidays at the time. Normally is should take a few days. REMEMBER THE PASSWORD YOU SUBMITTTED!
 
The security for the grid is quite robust and as such, a signed certificate from a known signing authority is needed. As I am working in Canada, I used [http://westgrid.ca/ Westgrid] to get my certificate from [http://www.gridcanada.ca/ca/index.html Grid Canada] under a project already registered at the UofR. You will require a sponsor who will verify that you are part of their project if you are not the project leader. It took 2-3 weeks to get my certificate  since people at Westgrid were on holidays at the time. Normally is should take a few days. REMEMBER THE PASSWORD YOU SUBMITTTED!
  
The DOE will also provide certificates. You can follow the [http://www.doegrids.org/pages/cert-request.html instructions here]. Other signing authorities are also available.
+
OSG user certificates are obtained through the CIlogon CA, operated by NCSA on behalf of the InCommon Federation. You can scan the [http://www.cilogon.org/faq CIlogon FAQ] for instructions on how to request a certificate for use on the OSG.
  
It is also possible to generate your own certificate, though I have no experience of this. Instructions can be found [http://security.ncsa.uiuc.edu/research/grid-howtos/usefulopenssl.php here].
+
The instructions on the page noted above involved completing the steps in your browser.  It is possible to complete everything from a unix shell and command line. The only difference will be in the order of the steps, but the final result will be the same.
  
When my Westgrid account was finally setup, I was given a key pair (two files, the certificate and the private key for that certificate, ''cert.pem'' and ''key.pem''). Keep these safe in a place no one can access them as they are not encrypted. For encryption security and use on the Grid I converted these into a PKCS12 file (''usercred.p12'') on my client machine using OpenSSL (OpenSSL must be installed on the client. This is generally a distro package i.e. "apt-get install openssl"). The following command will convert your certificate and key to a PKCS12 file:
+
As noted in the instructions on the CIlogon FAQ page, you must download your certificate with the same browser on the same user on the same machine from which you requested the certificate.  You will also need to download the CA certificate before you will be able to download your personal certificate.  Following the steps on this page, you will end up with a .p12 file from which you will extract the cert.pem and key.pem files that are mentioned below. Once these files are created, you will no longer need the .p12 file and should remove it. When you import your certificate, you should take note of the dates of validity. It is useful to include this information in your final .p12 file (not the one I suggested removing above!). For example, uname-osg-7-2017.p12.
  
   openssl pkcs12 -export -in cert.pem -inkey key.pem -out usercred.p12
+
When my Westgrid account was finally setup, I was given a key pair (two files, the certificate and the private key for that certificate, ''cert.pem'' and ''key.pem''). Keep these safe in a place no one can access them as they are not encrypted. For encryption security and use on the Grid I converted these into a PKCS12 file (''usercred.p12'')  on my client machine using OpenSSL (OpenSSL must be installed on the client. This is generally a distro package i.e. "apt-get install openssl"). "bash$" indicates the shell prompt. The following command will convert your certificate and key to a PKCS12 file:
 +
 
 +
   bash$ openssl pkcs12 -export -in cert.pem -inkey key.pem -out usercred.p12
 
   
 
   
 
You will be prompted for an export password. This is the password you provided when you applied for or created your certificate.
 
You will be prompted for an export password. This is the password you provided when you applied for or created your certificate.
  
In your home directory on your client create a directory called ".globus" and move your usercred.p12 file there.
+
In your home directory on your client create a directory called ".globus" and move your usercred.p12 file there.  
 
+
mkdir -p ~/.globus
+
mv usercred.p12 ~/.globus/.
+
  
 +
bash$ mkdir -p ~/.globus
 +
bash$ mv usercred.p12 ~/.globus/.
  
 +
You will need to change the permissions on the files in your .globus directory to user only in order to generate a proxy.
 +
bash$ chmod u=rw,go= ~/.globus/*
  
 
== '''Step 2: Registering for the Gluex VO''' ==  
 
== '''Step 2: Registering for the Gluex VO''' ==  
Line 32: Line 40:
 
Under Preferences > Advanced > Encryption > View Certificates > Import you can add your newly created personal certificate.
 
Under Preferences > Advanced > Encryption > View Certificates > Import you can add your newly created personal certificate.
  
Now go to the [https://gryphn.phys.uconn.edu:8443/vomrs/Gluex/vomrs Gluex VO Registration] page and register as a user. I selected the simulation, production and software roles. There are a couple phases with approval processes in between. This will take a day or so to complete. Further information on the registrans can be found in the [http://computing.fnal.gov/docs/products/vomrs/vomrs1_3/wwhelp/wwhimpl/js/html/wwhelp.htm VOM Registration Service User and Admin Guide].
+
Now go to the [https://gryphn.phys.uconn.edu:8443/voms/Gluex/user/home.action Gluex VO Registration] page and register as a user. I selected the simulation, production and software roles. There are a couple phases with approval processes in between. This will take a day or so to complete. Further information on the registration process can be found in the [http://computing.fnal.gov/docs/products/vomrs/vomrs1_3/wwhelp/wwhimpl/js/html/wwhelp.htm VOM Registration Service User and Admin Guide].
 
+
  
 
== '''Step 3: Installing OSG Client software''' ==
 
== '''Step 3: Installing OSG Client software''' ==
  
I installed the software as root, though the OSG instructions imply that this can be done for a single user. I am also using the bash shell.
+
I installed the software as root, though the OSG instructions imply that this can be done for a single user. I am also using the bash shell.  Richard Jones installed it under his cue account at Jefferson Lab, under the directory ~jonesrt/osg-client.  Total installed size was 1.0GB, probably a substantial part of your user quota, so installing as root is recommended whenever possible.
  
Change to super user. Create a directory where the software will be installed:
+
To install as root, change to super user. "bash$" indicates the shell prompt. Create a directory where the software will be installed:
 
   
 
   
  su
+
  bash$ su
  mkdir -p /usr/local/osg
+
  bash# mkdir -p /usr/local/osg
  cd /usr/local/osg
+
  bash# cd /usr/local/osg
  
 
=== '''Pacman Install''' ===
 
=== '''Pacman Install''' ===
To install the OSG Client software, you will require the installer Pacman. I followed the instruction on the Open Science Grid [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/PacmanInstall Pacman Install site]. Be sure to follow the instruction for OSG 1.2 only. (The Pacman install did not work with my Mandriva distro which is why I switched to the Debian machine.) "bash$" indicates the shell prompt.
+
To install the OSG Client software, you will require the installer Pacman. I followed the instruction on the Open Science Grid [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/PacmanInstall Pacman Install site]. Be sure to follow the instruction for OSG 1.2 only. (The Pacman install did not work with my Mandriva distro which is why I switched to the Debian machine.)  
  
  bash$ wget http://atlas.bu.edu/~youssef/pacman/sample_cache/tarballs/pacman-3.28.tar.gz
+
  bash# wget http://atlas.bu.edu/~youssef/pacman/sample_cache/tarballs/pacman-3.28.tar.gz
  bash$ tar --no-same-owner -xzvf pacman-3.28.tar.gz
+
  bash# tar --no-same-owner -xzvf pacman-3.28.tar.gz
  bash$ cd pacman-3.28
+
  bash# cd pacman-3.28
  
 
For sh and bash shells:
 
For sh and bash shells:
  bash$ source setup.sh
+
  bash# source setup.sh
  
 
For csh and tcsh shells:
 
For csh and tcsh shells:
  tcsh$ source setup.csh
+
  tcsh# source setup.csh
  
 
=== '''Installing the OSG client''' ===
 
=== '''Installing the OSG client''' ===
Line 62: Line 69:
 
I am following the OSG 1.2 instructions from [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ClientInstallationGuide here].
 
I am following the OSG 1.2 instructions from [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ClientInstallationGuide here].
  
  bash$ cd /usr/local/osg
+
  bash# cd /usr/local/osg
  bash$ pacman -get http://software.grid.iu.edu/osg-1.2:client
+
  bash# pacman -get http://software.grid.iu.edu/osg-1.2:client
   Do you want to add [http://software.grid.iu.edu/osg-1.2] to [trusted.caches]? (y/n/yall):  yall
+
   Do you want to add http://software.grid.iu.edu/osg-1.2 to trusted.caches? (y/n/yall):  yall
  
 
You maybe be prompted with other questions. Follow the example [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ClientInstallationGuide#Installing_as_root here].
 
You maybe be prompted with other questions. Follow the example [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ClientInstallationGuide#Installing_as_root here].
Line 70: Line 77:
 
Now you have to install the CA certificates.  
 
Now you have to install the CA certificates.  
  
  bash$ . ./setup.sh
+
  bash# . ./setup.sh
  bash$ vdt-ca-manage setupca --location local --url osg
+
  bash# vdt-ca-manage setupca --location local --url osg
 +
 
 +
Finally, you must turn on the services that you want to run on the client.  Normally the cron job that updates the CA certificates from the OSG repository (vdt-update-certs), the cron job that updates the crl's (fetch-crl), and the condor client (condor) to be enabled.
 +
 
 +
bash# vdt-control --list
 +
bash# vdt-control --enable vdt-update-certs
 +
bash# vdt-control --enable fetch-crl
 +
bash# vdt-control --enable condor
 +
bash# vdt-control --on condor
  
 
Now, everything should have setup fine. Let's check:
 
Now, everything should have setup fine. Let's check:
  
  bash$ source /usr/local/osg/setup.sh
+
  bash# source /usr/local/osg/setup.sh
  bash$ vdt-version
+
  bash# vdt-version
  
 
Everything but "CA Certificates" should have an "OK" beside it. I added the "source setup.sh" line to the bash profile so I would not have to manually source it every time. There are instructions for testing the OSG client software here: [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ValidateClients Validate Clients].
 
Everything but "CA Certificates" should have an "OK" beside it. I added the "source setup.sh" line to the bash profile so I would not have to manually source it every time. There are instructions for testing the OSG client software here: [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ValidateClients Validate Clients].
Line 90: Line 105:
 
  fi
 
  fi
  
 
+
You should replace /usr/local/osg with the directory in which you performed the original pacman -get command above.
 
+
  
 
You must define your client's IP address or the grid will report an error that it doesn't know where the client is. Because there are firewall's involved we must also define some port ranges.These are the current ports set for the Gluex voms. I have edited /usr/local/osg/vdt/etc/vdt-local-setup.sh and put these there.
 
You must define your client's IP address or the grid will report an error that it doesn't know where the client is. Because there are firewall's involved we must also define some port ranges.These are the current ports set for the Gluex voms. I have edited /usr/local/osg/vdt/etc/vdt-local-setup.sh and put these there.
  bash$ emacs  /usr/local/osg/vdt/etc/vdt-local-setup.sh
+
  bash# emacs  $VDT_LOCATION/vdt/etc/vdt-local-setup.sh
 
Your  vdt-local-setup.sh will look like:
 
Your  vdt-local-setup.sh will look like:
 
     # This file is sourced by setup.sh.  Use it for any custom setup for this site.
 
     # This file is sourced by setup.sh.  Use it for any custom setup for this site.
Line 113: Line 127:
 
Then restart condor
 
Then restart condor
  
  bash$ /etc/init.d/condor stop
+
  bash# /etc/init.d/condor stop
  bash$ /etc/init.d/condor start
+
  bash# /etc/init.d/condor start
  
 
Note: if your client machine is also behind a firewall, you must open those ports and a few others. See the [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ComputeElementFirewalls#Port_ranges OSG Firewall Documentation] for help.
 
Note: if your client machine is also behind a firewall, you must open those ports and a few others. See the [https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ComputeElementFirewalls#Port_ranges OSG Firewall Documentation] for help.
Line 120: Line 134:
 
== '''Step 4: Running Test Jobs and Practice''' ==
 
== '''Step 4: Running Test Jobs and Practice''' ==
  
Now, to see if it all actually worked!! ..or what didn't.
+
Now, to see if it all actually worked!! ..or what didn't. You can exit super-user and perform tasks as a normal Linux user now.
  
 
Now that the OSG Client has been installed, you will have to configure it with your grid user certificate in order to start submitting jobs. Make sure you are logged in as a normal user and we will setup a proxy certificate:  
 
Now that the OSG Client has been installed, you will have to configure it with your grid user certificate in order to start submitting jobs. Make sure you are logged in as a normal user and we will setup a proxy certificate:  
 
  bash$ voms-proxy-init -hours 24 -cert ~/.globus/usercred.p12 -voms Gluex
 
  bash$ voms-proxy-init -hours 24 -cert ~/.globus/usercred.p12 -voms Gluex
This will make the proxy valid for 24 hours. Problems arise if the proxy expires before jobs complete so be sure to make this long enough. Use -help in the proxy-init commands to find out other options. You will be prompted to enter your certificate password. The proxy can be renewed at any time.
+
If you have installed your certificate in ~/.globus/usercred.p12 then the -cert option is not required.  This command will generate a proxy certificate that is valid for 24 hours. Problems arise if the proxy expires before jobs complete so be sure to make this long enough. Use -help in the proxy-init commands to find out other options. You will be prompted to enter your certificate password. The proxy can be renewed at any time ( a job submitted to condor will retain the current proxy and will not change once submitted ).
  
 
To see the proxy information:
 
To see the proxy information:
Line 137: Line 151:
  
 
Something more useful would be to look at the contents of a folder where I have built my software:
 
Something more useful would be to look at the contents of a folder where I have built my software:
  globus-job-run grendl.phys.uconn.edu /bin/ls -ltr /nfs/direct/radphi/app/Gluex
+
  bash$ globus-job-run grendl.phys.uconn.edu /bin/bash -c 'ls -ltr $OSG_APP/Gluex'
 
+
 
   
 
   
 
If I wanted to copy a file from my client that I am working on to the grid I would use:  
 
If I wanted to copy a file from my client that I am working on to the grid I would use:  
  globus-url-copy file:////home/leverin/condor-tutorial/submit/README gsiftp://grendl.phys.uconn.edu/nfs/direct/radphi/app/Gluex/test/README_GRID
+
  bash$ globus-url-copy file:////home/leverin/condor-tutorial/submit/README \
 +
gsiftp://grendl.phys.uconn.edu/nfs/direct/app/Gluex/test/README_GRID
  
  
 
Copying a file from somewhere on the web would look like:
 
Copying a file from somewhere on the web would look like:
  globus-url-copy http://www.jlab.org/Hall-D/datatables/hd_res_photon.root gsiftp://grendl.phys.uconn.edu/nfs/direct/radphi/app/Gluex/test/hd_res_photon.root
+
  bash$ globus-url-copy http://www.jlab.org/Hall-D/datatables/hd_res_photon.root \
 +
gsiftp://grendl.phys.uconn.edu/nfs/direct/app/Gluex/test/hd_res_photon.root
  
 
'''SRM'''
 
'''SRM'''
Line 151: Line 166:
  
 
The following will show the contents of the Gluex storage folder where my results are stored. I've saved my HDGEANT output as this is the most time intensive part of the job and will not likely need to be redone. The reconstruction/analysis is saved here so I can move it later for analysis.
 
The following will show the contents of the Gluex storage folder where my results are stored. I've saved my HDGEANT output as this is the most time intensive part of the job and will not likely need to be redone. The reconstruction/analysis is saved here so I can move it later for analysis.
  srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0
+
  bash$ srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0
  srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0/hdgeant_output
+
  bash$ srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0/hdgeant_output
  srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0/analysis_output
+
  bash$ srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0/analysis_output
  
 
Other handy commands are ''srmcp'' for copying files from the job directory, ''srmrm'' for removing files in storage and ''srmmkdir'' to make new folders in storage for organization. Like normal Linux commands with srm prepended. Again, it it necessary to specify the full path to the file locations.
 
Other handy commands are ''srmcp'' for copying files from the job directory, ''srmrm'' for removing files in storage and ''srmmkdir'' to make new folders in storage for organization. Like normal Linux commands with srm prepended. Again, it it necessary to specify the full path to the file locations.
  srmcp file:///$JOB_HOME/hdgeant_cut.hddm srm://grinch.phys.uconn.edu/Gluex/testfolder/HDGEANT_OUTFILE
+
  bash$ srmcp file:///$JOB_HOME/hdgeant_cut.hddm srm://grinch.phys.uconn.edu/Gluex/testfolder/HDGEANT_OUTFILE
  srmrm srm://grinch.phys.uconn.edu/Gluex/eta-pi0/testfolder/HDGEANT_OUTFILE
+
  bash$ srmrm srm://grinch.phys.uconn.edu/Gluex/eta-pi0/testfolder/HDGEANT_OUTFILE
  srmmkdir srm://grinch.phys.uconn.edu/Gluex/testfolder
+
  bash$ srmmkdir srm://grinch.phys.uconn.edu/Gluex/testfolder
 
+
 
+
  
 
=='''Step 5: Job Management and Condor-G'''==
 
=='''Step 5: Job Management and Condor-G'''==
Line 166: Line 179:
 
The Grid currently uses condor as the job manager. '''This is the tutorial I followed and you should too: [http://www.ci.uchicago.edu/osgedu/schools/gridlab/condor.html Job Management with Condor]'''. It is fairly extensive. The only  differences from submitting to the GlueX Grid will be that we will use a different executable and grid-resource.
 
The Grid currently uses condor as the job manager. '''This is the tutorial I followed and you should too: [http://www.ci.uchicago.edu/osgedu/schools/gridlab/condor.html Job Management with Condor]'''. It is fairly extensive. The only  differences from submitting to the GlueX Grid will be that we will use a different executable and grid-resource.
  
This is a test script provided to me by Richard: [http://lafayette.phys.uregina.ca/leverin/grid/condorlogs.tgz download condor-g0]
+
This is a test script provided to me by Richard: [http://lafayette.phys.uregina.ca/leverin/grid/condor-g0.tgz download condor-g0]
 
Untar this and use the submit file there instead of the submit file they have you create in the tutorial. We just want a simple but non-trivial program to execute and primetest does not exist on the Gluex grid.
 
Untar this and use the submit file there instead of the submit file they have you create in the tutorial. We just want a simple but non-trivial program to execute and primetest does not exist on the Gluex grid.
  
Line 179: Line 192:
 
  notification=never
 
  notification=never
 
  universe=grid
 
  universe=grid
  grid_resource=gt2 grendl.phys.uconn.edu/jobmanager-condor
+
  grid_resource=gt2 gluskap.phys.uconn.edu/jobmanager-condor
  #grid_resource=gt4 https://grendl.phys.uconn.edu:9443 Condor
+
  #grid_resource=gt4 https://gluskap.phys.uconn.edu:9443 Condor
 +
globusrsl = (condorsubmit=(requirements 'Arch == \"Intel\"'))
 
  queue
 
  queue
Arguments to pass to your script can be put here. The example executable script is told to sleep for 10 seconds. The gt4 grid resource isn't functioning at the moment but gt2 works well enough for now so we use that.
+
Arguments to pass to your script can be put here. The example executable script is told to sleep for 10 seconds. The gt4 grid resource isn't functioning at the moment but gt2 works well enough for now so we use that.  The cluster is a mix of 64-bit and 32-bit machines.  The globusrsl command restricts the build to a 32-bit machine.
  
  
Line 197: Line 211:
 
A setup script, ''setup.sh'' that defines the environment variables is found in $OSG_APP/Gluex/test. This is where a custom $HALLD_HOME should be defined.  
 
A setup script, ''setup.sh'' that defines the environment variables is found in $OSG_APP/Gluex/test. This is where a custom $HALLD_HOME should be defined.  
 
This is where any libraries and executables will be moved to and is currently set as:  
 
This is where any libraries and executables will be moved to and is currently set as:  
  HALLD_HOME=/nfs/direct/radphi/app/Gluex/test
+
  HALLD_HOME=/nfs/direct/app/Gluex/test
 
Change this to something new if you want your code to be put there, place the new setup.sh in that folder and source it in your script submitted to condor.
 
Change this to something new if you want your code to be put there, place the new setup.sh in that folder and source it in your script submitted to condor.
  
Line 207: Line 221:
  
 
Then submit the build job to condor:
 
Then submit the build job to condor:
  condor_submit build_hdds.sub
+
  bash$ condor_submit build_hdds.sub
  
 
Check the log, error and output files for clues to success or failure. Look at HALLD_HOME to see if everything is where is should be and all the permissions are set properly.
 
Check the log, error and output files for clues to success or failure. Look at HALLD_HOME to see if everything is where is should be and all the permissions are set properly.
  more build_hdds.log
+
  bash$ more build_hdds.log
  more build_hdds.error
+
  bash$ more build_hdds.error
  more build_hdds.output
+
  bash$ more build_hdds.output
  globus-job-run grendl.phys.uconn.edu /bin/ls -ltr /nfs/direct/radphi/app/Gluex/test
+
  bash$ globus-job-run grendl.phys.uconn.edu /bin/ls -ltr /nfs/direct/app/Gluex/test
  
  
Line 223: Line 237:
  
 
Then submit the build job to condor:
 
Then submit the build job to condor:
  condor_submit build_src.sub
+
  bash$ condor_submit build_src.sub
  
 
Results can be checked like before.
 
Results can be checked like before.
Line 246: Line 260:
 
To reduce overall disc space I cut out the unnecessary data from the hdgeant.hddm files. I used hddmcp to do this. The source code and executable are here: [http://lafayette.phys.uregina.ca/leverin/grid/cuthddm.tgz cuthddm.tgz]. The README file in the tarball explains how to modify and build your own. It does not need to be compiled on the grid and can just be moved to the HALLD_HOME/bin folder. The permission must be set on the grid as they don't transfer from the Linux client.
 
To reduce overall disc space I cut out the unnecessary data from the hdgeant.hddm files. I used hddmcp to do this. The source code and executable are here: [http://lafayette.phys.uregina.ca/leverin/grid/cuthddm.tgz cuthddm.tgz]. The README file in the tarball explains how to modify and build your own. It does not need to be compiled on the grid and can just be moved to the HALLD_HOME/bin folder. The permission must be set on the grid as they don't transfer from the Linux client.
  
  bash$ globus-url-copy file:////home/leverin/gluex/my_src/cuthddm/hddmcp /nfs/direct/radphi/app/Gluex/test/bin/Linux_CentOS5-i686-gcc4.1.2
+
  bash$ globus-url-copy file:////home/leverin/gluex/my_src/cuthddm/hddmcp \
  bash$ globus-job-run grendl.phys.uconn.edu /usr/bin/chmod a+x /nfs/direct/radphi/app/Gluex/test/bin/Linux_CentOS5-i686-gcc4.1.2/hddmcp
+
/nfs/direct/app/Gluex/test/bin/Linux_CentOS5-i686-gcc4.1.2
 +
  bash$ globus-job-run grendl.phys.uconn.edu /usr/bin/chmod a+x \
 +
/nfs/direct/app/Gluex/test/bin/Linux_CentOS5-i686-gcc4.1.2/hddmcp
  
 
'''fcalTree4'''
 
'''fcalTree4'''
Line 254: Line 270:
 
I am interested in eta-pi0 reconstruction. I use HDParSim to handle the protons and identify DPhoton showers due to charged particles. The output is a ROOT file.
 
I am interested in eta-pi0 reconstruction. I use HDParSim to handle the protons and identify DPhoton showers due to charged particles. The output is a ROOT file.
  
HDParsim needs the resolution tables found here [[HOWTO run the semi-parametric Monte Carlo]]. I moved the root files to /nfs/direct/radphi/app/Gluex/eta-pi0/lib
+
HDParsim needs the resolution tables found here [[HOWTO run the semi-parametric Monte Carlo]]. I moved the root files to /nfs/direct/app/Gluex/eta-pi0/lib
 
and then linked to them for each job rather than moving them each time. This saves some time and bandwidth.
 
and then linked to them for each job rather than moving them each time. This saves some time and bandwidth.
  
 
== '''A Background Simulation and Analysis''' ==
 
== '''A Background Simulation and Analysis''' ==
 +
For my jobs I require longer time and I will submit jobs under the simulation role.
 +
 +
bash$ voms-proxy-init -hours 72 -cert ~/.globus/usercred.p12 -voms Gluex:/Gluex/simulation
 +
 
I've written a PERL script that creates job directories, the run.ffr file needed by bggen, the control.in file needed by hdgeant, and the submission files needed for condor. Each job has unique random number seeds so that each job is statistically different from each other. The perl script uses templates and then writes out the unique job files to the job folder.
 
I've written a PERL script that creates job directories, the run.ffr file needed by bggen, the control.in file needed by hdgeant, and the submission files needed for condor. Each job has unique random number seeds so that each job is statistically different from each other. The perl script uses templates and then writes out the unique job files to the job folder.
  

Latest revision as of 11:35, 2 January 2018

Note on this page

The contents of this page were last updated in 2010. As of May 2014, the contents of Using the Grid supersedes this page.

Introduction

What I'm going to outline below is how I got my jobs to run on the Grid and what I need to do it. I'll try to include documentation where I can and fixes for other OS's. My OSG client machine is a Debian Lenny distro here at the University of Regina. It was a random machine I had available but the OSG software works fine on it. (It didn't on my Mandriva 2010 distro on my desktop.) This was a month's worth of trial and error fixing bugs and firewall issues on the Grid with Richard Jones, but we have most of them worked out and everything has been running fine for the last week or so.

Updates to Blake's HOWTO are being made by Jake.

Step 1: Getting your Grid Certificate

The security for the grid is quite robust and as such, a signed certificate from a known signing authority is needed. As I am working in Canada, I used Westgrid to get my certificate from Grid Canada under a project already registered at the UofR. You will require a sponsor who will verify that you are part of their project if you are not the project leader. It took 2-3 weeks to get my certificate since people at Westgrid were on holidays at the time. Normally is should take a few days. REMEMBER THE PASSWORD YOU SUBMITTTED!

OSG user certificates are obtained through the CIlogon CA, operated by NCSA on behalf of the InCommon Federation. You can scan the CIlogon FAQ for instructions on how to request a certificate for use on the OSG.

The instructions on the page noted above involved completing the steps in your browser. It is possible to complete everything from a unix shell and command line. The only difference will be in the order of the steps, but the final result will be the same.

As noted in the instructions on the CIlogon FAQ page, you must download your certificate with the same browser on the same user on the same machine from which you requested the certificate. You will also need to download the CA certificate before you will be able to download your personal certificate. Following the steps on this page, you will end up with a .p12 file from which you will extract the cert.pem and key.pem files that are mentioned below. Once these files are created, you will no longer need the .p12 file and should remove it. When you import your certificate, you should take note of the dates of validity. It is useful to include this information in your final .p12 file (not the one I suggested removing above!). For example, uname-osg-7-2017.p12.

When my Westgrid account was finally setup, I was given a key pair (two files, the certificate and the private key for that certificate, cert.pem and key.pem). Keep these safe in a place no one can access them as they are not encrypted. For encryption security and use on the Grid I converted these into a PKCS12 file (usercred.p12) on my client machine using OpenSSL (OpenSSL must be installed on the client. This is generally a distro package i.e. "apt-get install openssl"). "bash$" indicates the shell prompt. The following command will convert your certificate and key to a PKCS12 file:

 bash$ openssl pkcs12 -export -in cert.pem -inkey key.pem -out usercred.p12

You will be prompted for an export password. This is the password you provided when you applied for or created your certificate.

In your home directory on your client create a directory called ".globus" and move your usercred.p12 file there.

bash$ mkdir -p ~/.globus
bash$ mv usercred.p12 ~/.globus/.

You will need to change the permissions on the files in your .globus directory to user only in order to generate a proxy.

bash$ chmod u=rw,go= ~/.globus/*

Step 2: Registering for the Gluex VO

To access the GlueX VO page, you must install your security certificate in your browser. It'll reject you otherwise.

For Firefox: Under Preferences > Advanced > Encryption > View Certificates > Import you can add your newly created personal certificate.

Now go to the Gluex VO Registration page and register as a user. I selected the simulation, production and software roles. There are a couple phases with approval processes in between. This will take a day or so to complete. Further information on the registration process can be found in the VOM Registration Service User and Admin Guide.

Step 3: Installing OSG Client software

I installed the software as root, though the OSG instructions imply that this can be done for a single user. I am also using the bash shell. Richard Jones installed it under his cue account at Jefferson Lab, under the directory ~jonesrt/osg-client. Total installed size was 1.0GB, probably a substantial part of your user quota, so installing as root is recommended whenever possible.

To install as root, change to super user. "bash$" indicates the shell prompt. Create a directory where the software will be installed:

bash$ su
bash# mkdir -p /usr/local/osg
bash# cd /usr/local/osg

Pacman Install

To install the OSG Client software, you will require the installer Pacman. I followed the instruction on the Open Science Grid Pacman Install site. Be sure to follow the instruction for OSG 1.2 only. (The Pacman install did not work with my Mandriva distro which is why I switched to the Debian machine.)

bash# wget http://atlas.bu.edu/~youssef/pacman/sample_cache/tarballs/pacman-3.28.tar.gz
bash# tar --no-same-owner -xzvf pacman-3.28.tar.gz
bash# cd pacman-3.28

For sh and bash shells:

bash# source setup.sh

For csh and tcsh shells:

tcsh# source setup.csh

Installing the OSG client

I am following the OSG 1.2 instructions from here.

bash# cd /usr/local/osg
bash# pacman -get http://software.grid.iu.edu/osg-1.2:client
 Do you want to add http://software.grid.iu.edu/osg-1.2 to trusted.caches? (y/n/yall):  yall

You maybe be prompted with other questions. Follow the example here.

Now you have to install the CA certificates.

bash# . ./setup.sh
bash# vdt-ca-manage setupca --location local --url osg

Finally, you must turn on the services that you want to run on the client. Normally the cron job that updates the CA certificates from the OSG repository (vdt-update-certs), the cron job that updates the crl's (fetch-crl), and the condor client (condor) to be enabled.

bash# vdt-control --list
bash# vdt-control --enable vdt-update-certs
bash# vdt-control --enable fetch-crl
bash# vdt-control --enable condor
bash# vdt-control --on condor

Now, everything should have setup fine. Let's check:

bash# source /usr/local/osg/setup.sh
bash# vdt-version

Everything but "CA Certificates" should have an "OK" beside it. I added the "source setup.sh" line to the bash profile so I would not have to manually source it every time. There are instructions for testing the OSG client software here: Validate Clients.

Setting up the OSG client

I've edited the system wide bash profile ("/etc/profile" in Debian) to source the OSG setup.sh:

VDT_LOCATION=/usr/local/osg
export VDT_LOCATION
if [ -r ${VDT_LOCATION}/setup.sh ]; then
. ${VDT_LOCATION}/setup.sh
fi

You should replace /usr/local/osg with the directory in which you performed the original pacman -get command above.

You must define your client's IP address or the grid will report an error that it doesn't know where the client is. Because there are firewall's involved we must also define some port ranges.These are the current ports set for the Gluex voms. I have edited /usr/local/osg/vdt/etc/vdt-local-setup.sh and put these there.

bash# emacs  $VDT_LOCATION/vdt/etc/vdt-local-setup.sh

Your vdt-local-setup.sh will look like:

   # This file is sourced by setup.sh.  Use it for any custom setup for this site.
   # This file will be preserved across VDT installations if OLD_VDT_LOCATION is set
   .
   export GLOBUS_HOSTNAME=your.ip.add.ress
   export GLOBUS_TCP_PORT_RANGE=45000,49999
   export GLOBUS_TCP_SOURCE_RANGE=45000,49999


Also you must specify the port range for use by Condor-g in the $CONDOR_CONFIG file. I put this at the end of Part 1 in $CONDOR_CONFIG. (/usr/local/osg/condor/etc/condor_config ) ( Note: condor must be restarted):

HIGHPORT = 49999
LOWPORT = 45000 


Then restart condor

bash# /etc/init.d/condor stop
bash# /etc/init.d/condor start

Note: if your client machine is also behind a firewall, you must open those ports and a few others. See the OSG Firewall Documentation for help.

Step 4: Running Test Jobs and Practice

Now, to see if it all actually worked!! ..or what didn't. You can exit super-user and perform tasks as a normal Linux user now.

Now that the OSG Client has been installed, you will have to configure it with your grid user certificate in order to start submitting jobs. Make sure you are logged in as a normal user and we will setup a proxy certificate:

bash$ voms-proxy-init -hours 24 -cert ~/.globus/usercred.p12 -voms Gluex

If you have installed your certificate in ~/.globus/usercred.p12 then the -cert option is not required. This command will generate a proxy certificate that is valid for 24 hours. Problems arise if the proxy expires before jobs complete so be sure to make this long enough. Use -help in the proxy-init commands to find out other options. You will be prompted to enter your certificate password. The proxy can be renewed at any time ( a job submitted to condor will retain the current proxy and will not change once submitted ).

To see the proxy information:

bash$ voms-proxy-info -all


Now, to try to run something on the grid:

bash$ globus-job-run grendl.phys.uconn.edu /bin/hostname -f

This should return "grendl.phys.uconn.edu". Full path names to the executable must be used as no environment variable or PATH is defined on the grid this way. The following can be used to discover the full path to the executable if it is unknown:

bash$ globus-job-run grendl.phys.uconn.edu /usr/bin/which hostname


Something more useful would be to look at the contents of a folder where I have built my software:

bash$ globus-job-run grendl.phys.uconn.edu /bin/bash -c 'ls -ltr $OSG_APP/Gluex'

If I wanted to copy a file from my client that I am working on to the grid I would use:

bash$ globus-url-copy file:////home/leverin/condor-tutorial/submit/README \
gsiftp://grendl.phys.uconn.edu/nfs/direct/app/Gluex/test/README_GRID


Copying a file from somewhere on the web would look like:

bash$ globus-url-copy http://www.jlab.org/Hall-D/datatables/hd_res_photon.root \
gsiftp://grendl.phys.uconn.edu/nfs/direct/app/Gluex/test/hd_res_photon.root

SRM SRM (Storage Resource Manager) is a protocol for Grid access to mass storage systems. The protocol itself is a collaboration (http://sdm.lbl.gov/srm-wg/) between Lawrence Berkeley (LBNL), Fermilab (FNAL), Jefferson (JLAB), CERN, and RAL. This is the management tool used for storing the large amounts of data produced by my simulation jobs.

The following will show the contents of the Gluex storage folder where my results are stored. I've saved my HDGEANT output as this is the most time intensive part of the job and will not likely need to be redone. The reconstruction/analysis is saved here so I can move it later for analysis.

bash$ srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0
bash$ srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0/hdgeant_output
bash$ srmls srm://grinch.phys.uconn.edu/Gluex/eta-pi0/analysis_output

Other handy commands are srmcp for copying files from the job directory, srmrm for removing files in storage and srmmkdir to make new folders in storage for organization. Like normal Linux commands with srm prepended. Again, it it necessary to specify the full path to the file locations.

bash$ srmcp file:///$JOB_HOME/hdgeant_cut.hddm srm://grinch.phys.uconn.edu/Gluex/testfolder/HDGEANT_OUTFILE
bash$ srmrm srm://grinch.phys.uconn.edu/Gluex/eta-pi0/testfolder/HDGEANT_OUTFILE
bash$ srmmkdir srm://grinch.phys.uconn.edu/Gluex/testfolder

Step 5: Job Management and Condor-G

The Grid currently uses condor as the job manager. This is the tutorial I followed and you should too: Job Management with Condor. It is fairly extensive. The only differences from submitting to the GlueX Grid will be that we will use a different executable and grid-resource.

This is a test script provided to me by Richard: download condor-g0 Untar this and use the submit file there instead of the submit file they have you create in the tutorial. We just want a simple but non-trivial program to execute and primetest does not exist on the Gluex grid.

Look at condor-g0.sub in the directory where you untar'd condor-g0.tgz

bash$ more condor-g0.sub

It should look something like:

executable=condor-g0.d/myscript.sh
arguments=TestJob 10
output=condor-g0.d/results.output
error=condor-g0.d/results.error
log=condor-g0.log
notification=never
universe=grid
grid_resource=gt2 gluskap.phys.uconn.edu/jobmanager-condor
#grid_resource=gt4 https://gluskap.phys.uconn.edu:9443 Condor
globusrsl = (condorsubmit=(requirements 'Arch == \"Intel\"'))
queue

Arguments to pass to your script can be put here. The example executable script is told to sleep for 10 seconds. The gt4 grid resource isn't functioning at the moment but gt2 works well enough for now so we use that. The cluster is a mix of 64-bit and 32-bit machines. The globusrsl command restricts the build to a 32-bit machine.


Now to submit the job and continue with the rest of the tutorial:

bash$ condor_submit condor-g0.sub

A trivial job will take a few minutes from submission to completion due to overhead in the condor process.

Step 6: Building your executables from source

What follows is what I had to do for my executables to work on the grid due the dynamic linking of the HallD libraries and other libraries need for the building and executing of the software. As such, I will link the scripts used to build HDDS, the HALLD software, a plugin and my custom executables based on the HallD software. (It's possible that an entirely static built binary can be submitted without having to compile source code on the cluster but that doesn't seem to work or was possible with what I needed.)

Building the HALLD source code

A setup script, setup.sh that defines the environment variables is found in $OSG_APP/Gluex/test. This is where a custom $HALLD_HOME should be defined. This is where any libraries and executables will be moved to and is currently set as:

HALLD_HOME=/nfs/direct/app/Gluex/test

Change this to something new if you want your code to be put there, place the new setup.sh in that folder and source it in your script submitted to condor.


HDDS

Now that HDDS is built separately from the HALLD source code, we need to build it first. The build files are here: download build_hdds.sh and build_hdds.sub. Remember to change the location of the setup.sh in build_hdds.sh. The script will download HDDS from the SVN repository build it it the job directory and then move it to HALLD_HOME and fix the permissions.

Then submit the build job to condor:

bash$ condor_submit build_hdds.sub

Check the log, error and output files for clues to success or failure. Look at HALLD_HOME to see if everything is where is should be and all the permissions are set properly.

bash$ more build_hdds.log
bash$ more build_hdds.error
bash$ more build_hdds.output
bash$ globus-job-run grendl.phys.uconn.edu /bin/ls -ltr /nfs/direct/app/Gluex/test


HALLD SRC

Once HDDS is built, we can now build the HallD source. I needed the bggen and hdgeant executables for background simulation.

The HALLD build files can be taken from here: build_src.sh and build_src.sub. Remember to change the location of the setup.sh in build_src.sh. The script will download the src from the SVN trunk build and move it.

Then submit the build job to condor:

bash$ condor_submit build_src.sub

Results can be checked like before.


HDParSim

For my reconstruction I need the HDParSim plugin that does not build with the standard source code. The submission files are here: build_hdparsim.sh and build_hdparsim.sub

Step 7: Building custom source code

filterHighE For doing background simulations, I needed to filter out a lot of events that wouldn't pass some simple energy and multiplicity cuts to keep the amount of disc space and CPU time reasonable. For this I used a JANA based program call filterHighE.

The source code for filterHighE is found here: filterHighE.tgz The submit files are here: build_filterHighE.sh and build_filterHighE.sub

Notice that the the transfer parameters in the submission file are uncommented now as we need to upload filterHighE.tgz to the job to be unpacked and built.

hddmcp To reduce overall disc space I cut out the unnecessary data from the hdgeant.hddm files. I used hddmcp to do this. The source code and executable are here: cuthddm.tgz. The README file in the tarball explains how to modify and build your own. It does not need to be compiled on the grid and can just be moved to the HALLD_HOME/bin folder. The permission must be set on the grid as they don't transfer from the Linux client.

bash$ globus-url-copy file:////home/leverin/gluex/my_src/cuthddm/hddmcp \
/nfs/direct/app/Gluex/test/bin/Linux_CentOS5-i686-gcc4.1.2
bash$ globus-job-run grendl.phys.uconn.edu /usr/bin/chmod a+x \
/nfs/direct/app/Gluex/test/bin/Linux_CentOS5-i686-gcc4.1.2/hddmcp

fcalTree4 My reconstruction code is here: fcalTree4.tgz. The submission files are here: build_fcaltree4.sh and build_fcaltree4.sub

I am interested in eta-pi0 reconstruction. I use HDParSim to handle the protons and identify DPhoton showers due to charged particles. The output is a ROOT file.

HDParsim needs the resolution tables found here HOWTO run the semi-parametric Monte Carlo. I moved the root files to /nfs/direct/app/Gluex/eta-pi0/lib and then linked to them for each job rather than moving them each time. This saves some time and bandwidth.

A Background Simulation and Analysis

For my jobs I require longer time and I will submit jobs under the simulation role.

bash$ voms-proxy-init -hours 72 -cert ~/.globus/usercred.p12 -voms Gluex:/Gluex/simulation

I've written a PERL script that creates job directories, the run.ffr file needed by bggen, the control.in file needed by hdgeant, and the submission files needed for condor. Each job has unique random number seeds so that each job is statistically different from each other. The perl script uses templates and then writes out the unique job files to the job folder.

A tarball of my job submission script is found here: jobsub.tgz

There is a loop in runfullanalysis.pl that determines the job numbers to submit. Edit this to vary the job number range. The bggen input card fort.15 has a line that controls the number of events to generate. The script moves the hdgeant_cut.hddm and fcaltree4.root file to storage and then deletes all the files in the job directory to clean things up and makes sure nothing large gets sent back to the client machine.