Difference between revisions of "HOWTO Execute a Launch using NERSC"
|Line 195:||Line 195:|
'''Listing problem jobs''': swif2 status -problems -workflow offmon_2018-01_ver18
'''Listing problem jobs''': swif2 status -problems -workflow offmon_2018-01_ver18
Revision as of 13:37, 17 October 2018
- 1 Introduction
- 2 NERSC Account
- 3 Setting up SSH
- 4 Files and directories on Cori at NERSC
- 5 Globus Endpoint Authentication
- 6 Submitting jobs to swif2
This page gives some instructions on executing a launch at NERSC. Note that some steps must be completed to make sure things are set up at Cori and Globus prior to submitting any jobs.
The following is based on steps used to do RunPeriod-2018-01 monitoring launch ver 18 using swif2.
To run jobs at NERSC you need to get a user account there. This account will need to be associated with a repository which is what they call a project that has some resources allocated to it. At this time, the GlueX project is m3120. You can find instructions for applying for an account here: http://www.nersc.gov/users/accounts/user-accounts/get-a-nersc-account.
Setting up SSH
Swif2 will access Cori at NERSC via passwordless login. To set this up, you’ll need a RSA key with empty passphrase and the public key installed on the NERSC account to be used. Chris’ instruction for this are:
(2-a) See http://www.nersc.gov/users/connecting-to-nersc/connecting-with-ssh (2-b) As the user who owns the workflow, generate an ssh key as specified, supply no passphrase (2-c) log in to nim.nersc.gov, and under "My ssh keys" add the public key you generated in (2-b) (2-d) verify that you can login to cori.nersc.gov without a password after logging in to ifarm as the workflow user.
For this to actually work, you must make sure that the key created above is what is used when authenticating. I created a dedicated key pair with the names ~/.ssh/id_rsa_nersc and ~/.ssh/id_rsa_nersc.pub . There are multiple ways to use this key (ssh-agent, using the ‘-i’ option with the ssh command,...). The best way to use it with swif2 though is to specify the key in the ~/.ssh/config file. There you can specify that the special key is used when logging into cori.nersc.gov and furthermore which username to use when logging in. This last part is important since I needed it to log into my davidl account from the gxproj4 account at jlab.
Here are the lines that need to be in the ~/.ssh/config file:
# The following is to allow passwordless login to # cori.nersc.gov without having to run an agent or # explicitly give the key on the ssh command. # n.b. this will login as davidl and not in some # group account at nersc (since none currently # exists) # 7/18/2018 DL Host cori cori.nersc.gov IdentityFile ~/.ssh/id_rsa_nersc User davidl
Files and directories on Cori at NERSC
When jobs are run at NERSC that will need access to a couple of files from the launch directory where we keep GlueX farm submission scripts and files. This is kept in our subversion repository at JLab. When a job is started at NERSC it will look for this directory in the project directory that swif2 is using for the workflow (see next section). The launch directory should already be checked out there but for completeness, here is how you would do it:
cd /global/project/projectdirs/m3120 svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/launch
If the directory is there, make sure it is up to date:
cd /global/project/projectdirs/m3120 svn update
The most important files are the script_nersc.py script and the jana_offmon_nersc.config (jana_recon_nersc.config) files. This first is what is actually run inside of the container when the job wakes up. The second specifies the plugins and other settings. For the most part, the "nersc" versions of the jana config files should be kept in alignment with the JLab versions. One notable difference is that the NERSC jobs are always run on whole nodes so NTHREADS is always set to "Ncores" whereas at JLab they are usually set to 24.
*** IMPORTANT *** At this point you should check the available scratch disk space for the account you will use to run the launch using the myquota command on Cori. If sufficient space is not available (i.e. 27GB x MAX_CONCURRENT_JOBS) then clear it out now.
Globus Endpoint Authentication
Submitting jobs to swif2
The offsite jobs at NERSC are managed from the gxproj4 account. This is a group account with access limited to certain users. Your ssh key must be added to the account by an existing member. Contact the software group to request access.
Generally, one would log into an appropriate computer with:
The following are some steps needed to create a workflow and submit jobs.
Create a new workflow
This step can actually be skipped since the script in the next step will automatically create the workflow with the correct name and parameters if it does not already exist. These are instructions in case you want/need to create the workflow yourself.
The workflow name follows a convention based on the type of launch, run period, version, and optional extra qualifiers. Here is the command used to create the workflow for offline monitoring launch ver18 for RunPeriod-2018-01:
swif2 create -workflow offmon_2018-01_ver18 -max-concurrent 2000 -site nersc/cori -site-storage nersc:m3120
The -max-concurrent 2000 option tells swif2 to limit the number of dispatched jobs to no more than 2000. The primary concern here is in scratch disk space at NERSC. If each input file is 20GB and produces 7GB of output then the we need 27GB * 2000 = 54 TB of free scratch disk space. If multiple launches are running at the same time and using the same account's scratch disk then it is up to you to make sure the sum of requirements does not exceed the quota. At this point in time we have a quota of 60TB of scratch space, though they have claimed that they will revisit that at the beginning of the year.
The -site nersc/cori is required at the moment and is the only allowed option for "site".
The -site-storage nersc:m3120 is used to specify which NERSC project assigned disk space to use. At this point, swif2 has been changed to use scratch disk space assigned to the personal account being used to run the jobs so I believe this is being ignored.
Prepare working directory at JLab
Create a working directory in the gxproj4 account and checkout the launch scripts. This is done so that the scripts can be modified for the specific launch in case some tweaks are needed. Changes should eventually be pushed back into the repository, but having a dedicated directory for the launch can help with managing things, especially when there are multiple launches.
mkdir ~gxproj4/NERSC/2018.10.05.offmon_ver18 cd ~gxproj4/NERSC/2018.10.05.offmon_ver18 svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/launch
Configure the parameters for the launch
At this time the parameters used for a NERSC launch are specified in the launch_nersc.py script. This is slightly different for jobs run at JLab which use the launch.py script that reads the configuration from a separate file. At some point the NERSC system should be brought more into alignment with that, but for now, this is how it is.
Edit the file launch/launch_nersc.py to adjust all of the settings at the top to be consistent with the current launch. All of the parameters are at the top of the file in a well marked section. Here is an explanation of the parameters:
|TESTMODE||Set this to "True" so the script can be tested without actually submitting any jobs. When finally ready to actually submit jobs to swif2, set it to "False"|
|VERBOSE||Default is 1. Set to zero for minimal messages or 3 for all messages|
|LAUNCHTYPE||either "offmon" or "recon"|
|VER||Version of this particular type of launch|
|WORKFLOW||This will be set automatically based on other values. Only change this if there default name is not appropriate|
|NAME||Similar to above, this is automatically set. It is used to set the job names.|
|RCDB_QUERY||If specific runs are not set in RUNS (see below) then this is used to query the RCDB for runs in the specified range that should be processed.|
|RUNS||Normally this is set to an empty array and the list of runs obtained from the RCDB. This can be set to a specific run list and only those will be processed.|
|MINRUN||If RUNS is an empty set, this is used along with MAXRUN and RCDB_QUERY to extract the list of runs to process from the RCDB. Note that this should be in a range consistent with RUNPERIOD. No check is made in this script that ensures this otherwise.|
|MAXRUN||See MINRUN above.|
|MINFILENO||Minimum file number to process for each run. Normally this is set to 0. See MAXFILENO below for more details.|
|MAXFILENO||Maximum file number to process for each run. If doing a monitoring launch then this would normally be set to 4 so that files 000-004 are processed. Set this to a large number like 10000 to process all files in each run. The RCDB will be queried for each run to so that only files that actually exist in the specified range are submitted as jobs.|
|MAX_CONCURRENT_JOBS||This is a limit set on the swif2 workflow for how many jobs can be in-flight at once. This can only be specified when the workflow is created. If the workflow was created outside of this script then this will do nothing.|
|PROJECT||The NERSC project. Use 'm3120' for the GlueX allocation|
|TIMELIMIT||Maximum time job may run. If a job runs longer than this it will be killed. Keep in mind that jobs run on KNL take about 2.4 times as long to run as on Haswell. (See NODETYPE below.)|
|QOS||Should be 'debug', 'regular', or 'premium'. Usually you will just want 'regular'|
|NODETYPE||'haswell' or 'knl'. Jobs will take about 2.4 times longer to run on knl as haswell and the charge rate is 20% more per node hour. However, there are only 2k haswell nodes and 9k knl nodes and much more demand for haswell. Note that the setting of TIMELIMIT should be adjusted based on this.|
|IMAGE||Shifter image. Actually, this is the same name as the Docker image used to create the Shifter image. Note that this image will need to already exist in Shifter since it will not be pulled in automatically. The image used up to now is 'docker:markito3/gluex_docker_devel'|
|RECONVERSION||This the version of the reconstruction code that should be used. The executables will be read from CVMFS from a directory that mirrors the directory /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr. This value should be a directory relative to that such as: halld_recon/halld_recon-recon-ver03.2|
|SCRIPTFILE||The script to run inside the container for the job. The container will mount the launch directory as /launch in the container so this should normally be set to '/launch/script_nersc.sh|
|CONFIG||The jana config file to use. This is set automatically but can be overridden if really needed. As with the SCRIPTFILE, this should access the file via the launch directory mounted in the container as /launch|
|OUTPUTTOP||Location on the JLab queue to copy files back to. Usually, this will start with mss:/mss/halld/ indicating a place on the tape system. This specifies the top-level directory and the launch_nersc.py script will specify the subdirectories where individual output files should be copied.|
|RCDB_HOST||Were the launch_nersc.py script should access the RCDB to extract the run numbers to be used for this launch.|
|RCDB_USER||User to connect to the RCDB as. (See RCDB_HOST above.)|
|RCDB||This is used internally in the script and should always be set to None|
Archiving Launch Parameters
Edit the appropriate html file in the run period's /group/halld/www/halldweb/html/data_monitoring/launch_analysis/ subdirectory (e.g. /group/halld/www/halldweb/html/data_monitoring/launch_analysis/2018_01/2018_01.html)
There may be two places to add the current launch
There are many places and ways that jobs can fail and it can be difficult to find information since it is dispersed over several systems. Here are some tips for tracking down issues.
SWIF2 is the starting and ending point for each job so it is an important first step. Unfortunately, with thousands of jobs, it is not usually practical to dump information to the screen scan it for the one you're interested in.
Finding the swif2 jobid: swif2 show-job -workflow offmon_2018-01_ver18 -name GLUEX_offmon_041261_001
Listing problem jobs: swif2 status -problems -workflow offmon_2018-01_ver18
Setting Time Limit: swif2 modify-jobs -workflow offmon_2018-01_ver18 -time set 10h -names GLUEX_offmon_040902_004 n.b. The normal job submission sets a time limit
swif2 retry-jobs -workflow offmon_2018-01_ver18 11926