HOWTO Execute a Launch using NERSC
This page gives some instructions on executing a launch at NERSC. Note that some steps must be completed to make sure things are set up at Cori and Globus prior to submitting any jobs.
The following is based on steps used to do RunPeriod-2018-01 monitoring launch ver 18 using swif2.
To run jobs at NERSC you need to get a user account there. This account will need to be associated with a repository which is what they call a project that has some resources allocated to it. At this time, the GlueX project is m3120. You can find instructions for applying for an account here: http://www.nersc.gov/users/accounts/user-accounts/get-a-nersc-account.
Setting up SSH
Swif2 will access Cori at NERSC via passwordless login. To set this up, you’ll need a RSA key with empty passphrase and the public key installed on the NERSC account to be used. Chris’ instruction for this are:
(2-a) See http://www.nersc.gov/users/connecting-to-nersc/connecting-with-ssh (2-b) As the user who owns the workflow, generate an ssh key as specified, supply no passphrase (2-c) log in to nim.nersc.gov, and under "My ssh keys" add the public key you generated in (2-b) (2-d) verify that you can login to cori.nersc.gov without a password after logging in to ifarm as the workflow user.
For this to actually work, you must make sure that the key created above is what is used when authenticating. I created a dedicated key pair with the names ~/.ssh/id_rsa_nersc and ~/.ssh/id_rsa_nersc.pub . There are multiple ways to use this key (ssh-agent, using the ‘-i’ option with the ssh command,...). The best way to use it with swif2 though is to specify the key in the ~/.ssh/config file. There you can specify that the special key is used when logging into cori.nersc.gov and furthermore which username to use when logging in. This last part is important since I needed it to log into my davidl account from the gxproj4 account at jlab.
Here are the lines that need to be in the ~/.ssh/config file:
# The following is to allow passwordless login to # cori.nersc.gov without having to run an agent or # explicitly give the key on the ssh command. # n.b. this will login as davidl and not in some # group account at nersc (since none currently # exists) # 7/18/2018 DL Host cori cori.nersc.gov IdentityFile ~/.ssh/id_rsa_nersc User davidl
Files and directories on Cori at NERSC
When jobs are run at NERSC that will need access to a couple of files from the launch directory where we keep GlueX farm submission scripts and files. This is kept in our subversion repository at JLab. When a job is started at NERSC it will look for this directory in the project directory that swif2 is using for the workflow (see next section). The launch directory should already be checked out there but for completeness, here is how you would do it:
cd /global/project/projectdirs/m3120 svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/launch
If the directory is there, make sure it is up to date:
cd /global/project/projectdirs/m3120 svn update
The most important files are the script_nersc.py script and the jana_offmon_nersc.config (jana_recon_nersc.config) files. This first is what is actually run inside of the container when the job wakes up. The second specifies the plugins and other settings. For the most part, the "nersc" versions of the jana config files should be kept in alignment with the JLab versions. One notable difference is that the NERSC jobs are always run on whole nodes so NTHREADS is always set to "Ncores" whereas at JLab they are usually set to 24.
Globus Endpoint Authentication
Submitting jobs to swif2
The offsite jobs at NERSC are managed from the gxproj4 account. This is a group account with access limited to certain users. Your ssh key must be added to the account by an existing member. Contact the software group to request access.
Generally, one would log into an appropriate computer with:
The following are some steps needed to create a workflow and submit jobs.
1. Create a new workflow. The workflow name follows a convention based on the type of launch, run period, version, and optional extra qualifiers. Here is the command used to create the workflow for offline monitoring launch ver18 for RunPeriod-2018-01:
swif2 create -workflow offmon_2018-01_ver18 -max-concurrent 2000 -site nersc/cori -site-storage nersc:m3120
The -max-concurrent 2000 option tells swif2 to limit the number of dispatched jobs to no more than 2000. The primary concern here is in scratch disk space at NERSC. If each input file is 20GB and produces 7GB of output then the we need 27GB * 2000 = 54 TB of free scratch disk space. If multiple launches are running at the same time and using the same account's scratch disk then it is up to you to make sure the sum of requirements does not exceed the quota. At this point in time we have a quota of 60TB of scratch space, though they have claimed that they will revisit that at the beginning of the year.
The -site nersc/cori is required at the moment and is the only allowed option for "site".
The -site-storage nersc:m3120 is used to specify which NERSC project assigned disk space to use. At this point, swif2 has been changed to use scratch disk space assigned to the personal account being used to run the jobs so I believe this is being ignored.
2. Create a working directory in the gxproj4 account and checkout the launch scripts. This is done so that the scripts can be modified for the specific launch in case some tweaks are needed. Changes should eventually be pushed back into the repository, but having dedicated directory for the launch can help with managing relative to other launches.
mkdir ~gxproj4/NERSC/2018.10.05.offmon_ver18 cd ~gxproj4/NERSC/2018.10.05.offmon_ver18 svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/launch
3. Edit the file launch/launch_nersc.py to adjust all of the settings at the top to be consistent with the current launch. Make sure TESTMODE is set to "True" so the script can be tested without actually submitting any jobs.