Torque: Multiple jobs
File: torquemulti. This update: 20091212.
This note is about multiple similar jobs submitted to
the Torque batch system.
The qsub'd script file is unique in that its contents are copied as
part of the job, and there is no harm in modifying it or deleting it
before the job runs. However, files that the job might use are not
treated that way: Torque does not scan your script to see what files it
might use and whether it might be useful to take a copy of them! And it
certainly can't tell what files a compiled program might use, whose
filenames might be embedded
in the program's source.
So it's up to you to make sure that the files that a job needs exist
and have the right contents for that job at the time the job runs. This
is of course
normally achieved by creating them before the qsub and not touching
them until that job has finished.
If you are going to
submit and/or run multiple similar (but slightly different) jobs at the
same time, then you have to be careful how they are submitted. For
example, if the jobs internally use a data input file whose contents
are to be
different for each job, then you have to find a sensible way of
achieving that; if you simply modified the data input file and
submitted a job, and then repeated that exercise, all the jobs would
probably run with the same (final) version of the data!
There are various techniques that people use for this issue - when this
is done many times it's usually worth writing a script which does the
repetitive work for you:
- If the datafile is short enough, and contains text and not
binary data, then include the data in the job script to be submitted as
a here-document (see man
bash, or
Google). The job script might be created dynamically for multiple jobs.
- Create a newly named datafile for each job,
and customise the job script so that it
contains that datafile name, and then qsub that script. The script
might be a dynamically created copy for multiple jobs.
- As before, create a newly named datafile for each job,
assign its name to an exported variable and pass that variable through
to the job script with the qsub -v
option. The job script can then be the same for every job.
You can assign the value to the variable before the qsub command
or as part of the qsub -v option (see man qsub).
- As before, create a newly named datafile for each job, and
create a short wrapper job
script which just invokes the original job script with the datafile
name as its argument, and then qsub the wrapper job script. The
original job script uses $1 as the name of the data
file to process, and can be the same for every job. The wrapper job
script would be the place to specify any qsub options which weren't on
the qsub command line.
- Create all the datafiles, with unique names, and and for
each datafile, use its name (or the variable part of it) as the name of
each job, passed through using the qsub -N option. The job script can then
pick up which data file to use by using the $PBS_JOBNAME parameter. The
standard output/error file names will be based on that name too. If you
follow that convention for output data files as well, then you will
have no problem of name clashes even if all files are stored within the
same directory.
- Create a new directory for the job, with a unique name, and
copy any files which need to be customised per job to it. Do the
customisation. Then do the qsub with that directory as the current
directory, so it will be passed through as $PBS_O_WORKDIR in the usual
way. Output files could be written to this directory too, and you
wouldn't have to worry about filenames being the same because they're
in a unique directory.
- Submit the same identical job script any required number of
times. This job script has to be cleverer than the one-off case, as it
needs to customise a data file at run-time, using criteria possibly
from some steering file. You still have to use a unique name for the
data file (unless you put it in /tmp and are using all the processors
on each node) and you still have to be careful about naming output
files. If you are dynamically reading and updating a steering file
in order to decide what this particular job should do, you would need
to use file-locking around reading and updating this file, because
other jobs would be accessing it too. See the simple example in man
lockfile.
- Similar to the last one, but here create all the possible
datafiles with unique names first, in a chosen directory. Then qsub the
same number of jobs as there are datafiles. Each job chooses the
next-available datafile from that directory, with suitable locking
techniques surrounding the code which makes that choice.
- Again similar, but submit multiple identical jobs at once
as Torque job arrays,
by using qsub with the -t parameter, and in each job
use the environmental variable $PBS_ARRAYID to choose what data file
the job will read or what action the job will take.
Any further suggestions welcome!
L.S.Lowe