Hall D Job Management System

From GlueXWiki
Jump to: navigation, search
  • This page details instructions on how to create and launch a set of jobs using the Hall-D Job Management System developed by Mark Ito. These instructions are generic: this system can be used for the weekly monitoring jobs, but can also be used for other sets of job launches as well.

Database Table Overview

  • Job management database table (<project_name>): For each input file, keeps track of whether or not a job for it has been submitted, along with other optional fields.
  • Job status database table (<project_name>Job (no space)): For each job, keeps track of the job-id, the job status, memory used, cpu & wall time, time taken to complete various stages (e.g. pending, dependency, active), and others.
  • Job metrics database table (<project_name>_aux (no space)): For each job, keeps track of the job-id, how many events were processed, the time it took to copy the cache file, and the time it took to run the plugin. This information is culled from the log files of each job, and is done within the analysis directory of each launch.

Initialize Project Management

  • Log into the ifarm machine with one of the gxproj accounts. For this example we will use gxproj1.
ssh gxproj1@ifarm -Y
  • Go to a directory to do the launch. In principle, any directory will work, but for gxproj1 this is usually done in /home/gxproj1/halld/
  • Check out the necessary scripts
    svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj
    This will get all necessary scripts for launching. Once checked out,
    cd projects
  • The script create_project.sh can be used to create a new project. It will take a single argument, the project name. It is assumed that the project name is of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata where this string will be parsed to give the run period 20YY_MM and version number VV. One thing to do BEFORE creation of a new project is to editthe conditions of the launch (plugins to run over, memory requested, disk space requested) within templates/template.jsub . This

information is saved automatically at project creation time into files in /group/halld/data_monitoring/run_conditions .

To create a project do
./create_project.sh offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata

The name has been chosen to be as consistent as possible with other directory structures. However, mysql requires that "-" be escaped in table names, so unfortunately run periods will be given as 2YYY_MM instead of 2YYY-MM.

  • A new directory with the project name (offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata) and files will be copied and modified from the template directory to reflect the run period, the user, the directory that it was created in, project name, etc.
  • For each project, cd into the new directory
  • The script clear.sh will remove any existing tables for the current project name, then recreate it. Do
./clear.sh
  • To use the jproj.pl script that was checked in, add the directory to your path with
source ../../scripts/setup.csh

or always specify the full path

../../scripts/jproj.pl
  • Now update the table of runs with
jproj.pl <project name> update

This will fill the table with all files within /mss that are of the same form as what is in <project name>.jproj . If you want to register only on a subset of all such files, you can edit this file directly.

  • Once you have registered all of the files you would like to run over, do
jproj.pl <project name> submit [max # of jobs] [run number]

where the additional options specify how many jobs to submit and which run number to run on. Without these options all files that are registered and have not been submitted yet will be submitted.

At this stage you are ready to submit all files. It is a good idea to submit a few test jobs at first to check that all scripts are working and that the plugins do not crash. Once you are sure that this does not happen, you can send all jobs in. The remaining jobs are then the monitoring which will (among other things) put the results on the online webpage for the collaboration to view, and the analysis of the launch.

Project File Overview

An overview of each project file:

  • clear.sh: For the current project, deletes the job status and management database tables (if any), and creates new, empty ones.
  • <project_name>.jproj: Contains the path and file name format for the input files for the jobs.
  • <project_name>.jsub: The xml job submission script. The run number and file number variables are set during job submission for each input file.
  • script.sh: The script that is executed during the job. If output job directories are not pre-created manually, they should be created in this script with the proper permissions:
mkdir -p -m 775 my_directory
  • setup_jlab-[run period].csh: The environment that is sourced at the beginning of the job execution.
  • status.sh: Updates the job status database table, and prints some of its columns to screen.

Project Management

  • Delete (if any) and create the database table(s) for the current set of job submissions:
./clear.sh

Also, if testing was done with jobs, it is best to delete the output directory and the configuration files:

rm -frv /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/verVV /group/halld/data_monitoring/run_conditions/soft_comm_20YY_MM_verVV.xml /group/halld/data_monitoring/run_conditions/jana_rawdata_comm_20YY_MM_verVV.conf
  • Search for input files matching the string in the .jproj file, and create a row for each in the job management database table (called <project_name>). You can test by adding an optional argument at the end, which only selects files with a specific file number:
jproj.pl <project_name> update <optional_file_number>
  • Confirm that the job management database is accurate by printing it's contents to screen:
mysql -hhallddb -ufarmer farming -e "select * from <project_name>"
  • ONLY if a mistake was made, to delete the tables from the database and recreate new, empty ones, run:
./clear.sh
  • To look at the status of the submitted jobs, first query auger and update the job status database:
fill_in_job_details.pl <project_name>
  • The job status can then be viewed by submitting a query to the job status database (called <project_name>Job (no space in between)):
mysql -hhallddb -ufarmer farming -e "select id,run,file,jobId,hostname,status,timeSubmitted,timeActive,walltime,cput,timeComplete,result,error from <project_name>Job"
  • These last two commands can instead be executed simultaneously by running:
./status.sh

Handy mysql Instructions

  • Handy mysql instructions:
mysql -hhallddb -ufarmer farming # Enter the "farming" mysql database on "hallddb" as user "farmer"
quit; # Exit mysql
show tables; # Show a list of the tables in the current database
show columns from <project_name>; # show all of the columns for the given table
select * from <project_name>; # show the contents of all rows from the given table

Backing Up Offline Monitoring Tables

Tables created for offline monitoring can be backed up using the script backup_tables.sh which can be checked out with the other files from https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/jproj/projects

The script uses the command mysqldump to print out a file that can be executed to recreate the tables. Since executing this output file will drop the table when it exists, caution is advised. Example usage to backup all three tables created for run period 2014_10 ver 17:

backup_tables.sh 2014_10 17