Batch Farm Information

Batch Farm Information

Specifications

The batch system available in the PP group is running HTCondor and consists of 14 nodes with a minimum of 32 cores per node and 1.5GB per core. In total there are a maximum of 480 cores available. A fairshare system is in place to minimise the monopoly of the cluster by any single user. To get an overview of the whole cluster and see what's currently running, use condor_status.

Submitting and Managing Jobs

To submit and manage jobs you can use the Ganga Job Management tool or use the condor commands directly. For the latter, there are two principle ways of submitting: Using a submission config script or direct submission of a script. The simplest option is just to submit a script using:

condor_qsub [script_name]

This will submit the job and it will wait in the queue until it's matched and then run on the given worker node. Note that you need to give an actual executable script in the above otherwise you'll get an error and the job will go on hold. If you want more control over the submission, including submitting batches of jobs, you should use a condor submission script. A basic version of this is:

Executable = [script_name]
Universe = vanilla
output = [stdout file]
error = [stderr file]
log = [condor log file]
# in MB
request_memory = 500
request_disk = 3000000
environment="TESTVAR=mytestvar"
queue

There are many more options to this subission file so please see the condor documentation for more info.

After you've submitted your job, you can view the status of it and any other jobs you've submitted using condor_q. If you want more information about a job, use condor_q -better-analyze [jobID].

Troubleshooting

If your jobs refuse to run or go immediately on to HOLD, then there are a few ways to find out what the problem is:

  • Use condor_q -better-analyze [jobID] and see if that gives a reason for the failure
  • Look at the produced condor log file. This will record each step of your jobs trip through the batch system and should give reasons for any failure
  • If your job has run but then failed, you can check the stderr file as that may also show problems
  • If your jobs are never going to run for whatever reason, please delete them using condor_rm [jobID]

If you've checked these but still can't figure out where the problem is, please get in touch with MWS.