Recovering from failed jobs


Sometimes simulations will fail, either by running out of allocated resources or perhaps due to a compute node failing. It is possible to recover from such failures by restarting from an output dump if one is available.

Restarting from an intermediate dump works the same as a continuation run (CRUN). Add the hand-edit file ~access/crun.ed to the job if not already present, then you will be able to resubmit the job. It will restart from the latest dump (*.da) file. There is no need to alter the run start time or run duration, it will simply continue from where it left off (You also don't have to enable automatic re-submission, restarting works without it).

crun.png

Continuing from a different dump


To continue the run from a dump that isn't the most recent edit the file $RUNID.phist (found in the model run directory), changing the value of ARESTART to point to the dump you'd like to restart from. Once this is done submit a CRUN following the instructions above