I'm using stampede.TACC for jobs that need significantly longer than 48 hours to run. Luckily, John Fonner at the Texas Advanced Computing Centre has been kind enough to prepare a SLURM script that circumvents that through daisychaining jobs.
Note that you should under no circumstances do this unless you've been specifically allowed to do so by your cluster manager.
If you get clearance, you submit the script and it will run in the background and resubmit scripts until the job is done.
To get the daisychain script, do
This will pull the latest version of daisychain.slurm. Rename it to e.g. edited.slurm
General editing of the slurm script:
1.
Replace all instances of
with
to avoid conflicts when several jobs are running concurrently
2.
To run the script on your own system which you've set up like shown in this post, change
to
If you're using stampede.TACC, stick to login1.
3. For gaussian jobs on stampede.TACC
A.
put
before
B
Set up your restart job scripts. For example, if the job section of your slurm script looks like this
Testing at home
I set up a home system with slurm as shown here: http://verahill.blogspot.com.au/2014/03/565-setting-up-slurm-on-debian-wheezy.html
First edit the daisychain.slurm script as shown above. Note that your slurm script must end with .slurm for the script to recognise it as a slurm script. You can get around this by editing your script and specifying a job script name.
Specifically, change the run time to
Next set up key-based log in for localhost (if you haven't got a keypair, use ssh-keygen:
Create two job files. sleeptest.sh:
Submit using
Make sure to change
Note that you should under no circumstances do this unless you've been specifically allowed to do so by your cluster manager.
If you get clearance, you submit the script and it will run in the background and resubmit scripts until the job is done.
To get the daisychain script, do
mkdir ~/tmp
cd ~/tmp
git clone https://github.com/johnfonner/daisychain.git
This will pull the latest version of daisychain.slurm. Rename it to e.g. edited.slurm
General editing of the slurm script:
1.
Replace all instances of
~/.daisychain
with
~/daisychain_$baseSlurmJobName
to avoid conflicts when several jobs are running concurrently
2.
To run the script on your own system which you've set up like shown in this post, change
loginNode="login1"
to
loginNode="localhost"
If you're using stampede.TACC, stick to login1.
3. For gaussian jobs on stampede.TACC
A.
put
module load gaussian
before
if [ "$thisJobNumber" -eq "1" ]; then
B
Set up your restart job scripts. For example, if the job section of your slurm script looks like this
with freq.g09in looking like
mkdir $SCRATCH/gaussian_tmp
export GAUSS_SCRDIR=$SCRATCH/gaussian_tmp
if [ "$thisJobNumber" -eq "1" ]; then
#first job
echo "Starting First Job:"
g09 < freq.g09in > output_$thisJobNumber
else
#continuation
echo "Starting Continuation Job:"
g09 < freq_restart.g09in > output_$thisJobNumber
fi
with freq.g09in being something along the lines of
%nprocshared=16
%rwf=/scratch/0XXXX/XXXX/gaussian_tmp/ajob.rwf
%Mem=2000000000
%Chk=/home1/0XXX/XXXX/myjob/ajob.chk
#P rpbe1pbe/GEN 5D Freq() SCF=(MaxCycle=256 ) Punch=(MO) Pop=()
(note that the above example is a bit special since it 1) saves the .rwf (which is huge) and 2) is restarting a frequency job. For a simple geoopt job it's enough to restart from the .chk file.
%nprocshared=16
%Mem=2000000000
%rwf=/scratch/0XXX/XXXX/gaussian_tmp/ajob.rwf
%Chk=/home1/0XXXX/XXXX/myjob/ajob.chk
#P restart
Testing at home
I set up a home system with slurm as shown here: http://verahill.blogspot.com.au/2014/03/565-setting-up-slurm-on-debian-wheezy.html
First edit the daisychain.slurm script as shown above. Note that your slurm script must end with .slurm for the script to recognise it as a slurm script. You can get around this by editing your script and specifying a job script name.
Specifically, change the run time to
comment out the partition name
#SBATCH -t 00:00:10 # Run time (hh:mm:ss)
and change the job section to
##SBATCH -p normal
#-------------------Job Goes Here--------------------------
if [ "$thisJobNumber" -eq "1" ]; then
echo "Starting First Job:"
sh sleeptest.sh
else
echo "Starting Continuation Job:"
sh sleeptest_2.sh
fi
#----------------------------------------------------------
Next set up key-based log in for localhost (if you haven't got a keypair, use ssh-keygen:
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost
exit
Create two job files. sleeptest.sh:
and
echo "first job"
date
sleep 65
date
echo "second job"
date
sleep 9
echo "Do nothing"
Submit using
sbatch test.slurm
Make sure to change
#SBATCH -J testx # Job namefor each job so that you can have several running concurrently.
ConversionConversion EmoticonEmoticon