I've had g09 frequency jobs die on me, and in g09 analytical frequency jobs can only be restarted using the .rwf. Because the .rwf files are 160 gb, I don't want to be copying them back and forth between nodes. It's easier then to simply make sure that the restarted job is run on the same node as the original job.
A good resource for SGE related stuff is http://rous.mit.edu/index.php/SGE_Instructions_and_Tips#Submitting_jobs_to_specific_queues
Either way, first figure out what node the job ran on. Assuming that the job number was 445:
Next figure out the PID, as this is used to name the Gau-[PID].rwf file:
You can now craft your restart file, g09_freq.restart -- you'll need to make sure that the paths are appropriate for your system:
The output goes to g09_freq.log. You know if the restart worked properly if it says
Note that restarting analytical frequency jobs in g09 can be a hit and miss affair. Jobs that run out of time are easy to restart, and some jobs that die silently have also been restarted successfully. On the other hand, a job that died because my resource allocations ran out couldn't be restarted i.e. restart started the freq job from scratch. The same happened with one a node of mine that has what seems like a dodgy PSU. Finally, I also couldn't restart jobs that died silently due to allocation all the RAM to g09 without leaving any to the OS (or at least that's the current best theory). It may thus be a good idea to back up the rwf file every now and again, in spite of the unwieldy size.
A good resource for SGE related stuff is http://rous.mit.edu/index.php/SGE_Instructions_and_Tips#Submitting_jobs_to_specific_queues
Either way, first figure out what node the job ran on. Assuming that the job number was 445:
qacct -j 445|grep hostname
hostname compute-0-6.local
Next figure out the PID, as this is used to name the Gau-[PID].rwf file:
grep PID g03.g03out
Entering Link 1 = /share/apps/gaussian/g09/l1.exe PID= 24286.
You can now craft your restart file, g09_freq.restart -- you'll need to make sure that the paths are appropriate for your system:
(having empty lines at the end of the file is important) and a qsub file, g09_freq.qsub:
%nprocshared=8
%Mem=900000000
%rwf=/scratch/Gau-24286.rwf
%Chk=/home/me/jobs/testing/delta_631gplusstar-freq/delta_631gplusstar-freq.chk
#P restart
Then submit it to the correct queue by doing
#$ -S /bin/sh
#$ -cwd
#$ -l h_rt=999:30:00
#$ -l h_vmem=8G
#$ -j y
#$ -pe orte 8
export GAUSS_SCRDIR=/tmp
export GAUSS_EXEDIR=/share/apps/gaussian/g09/bsd:/share/apps/gaussian/g09/local:/share/apps/gaussian/g09/extras:/share/apps/gaussian/g09
/share/apps/gaussian/g09/g09 g09_freq.restart > g09_freq.out
qsub -q all.q@compute-0-6.local g09_freq.qsub
The output goes to g09_freq.log. You know if the restart worked properly if it says
and
Skip MakeAB in pass 1 during restart.
Resume CPHF with iteration 214.
Note that restarting analytical frequency jobs in g09 can be a hit and miss affair. Jobs that run out of time are easy to restart, and some jobs that die silently have also been restarted successfully. On the other hand, a job that died because my resource allocations ran out couldn't be restarted i.e. restart started the freq job from scratch. The same happened with one a node of mine that has what seems like a dodgy PSU. Finally, I also couldn't restart jobs that died silently due to allocation all the RAM to g09 without leaving any to the OS (or at least that's the current best theory). It may thus be a good idea to back up the rwf file every now and again, in spite of the unwieldy size.
ConversionConversion EmoticonEmoticon