ACCESS-S


This is an as-of-yet unlisted document to chart my progress with ACCESS-S.
Once it's running, hopefully this will make it easier to convert it into a document.

Getting ACCESS-S


This is the initial email I got from Hailin:
Hi Holger,
 
If you'd like to play the ACCESS-S1 suite, you can copy my suite au-aa563.
 
The followings are the key parameters to run the suite.
 
In rose-suite.conf:
MAKE_BUILDS=true               #set true to compile source codes
N_GSHC_MEMBERS=3          #num of ensemble members, as for MEMBERS=-m in app/glosea_init_cntl_file/rose-app.conf
N_GSHC_STEPS=2                  #number of RESUBMIT (number of chunk runs)
RESUB_DAYS=1                     #number of days per chunk run
 
In app/glosea_init_cntl_file/rose-app.conf:
GS_HCST_START_DATE=1990050100   #start date, it is 01 of May in this case
MEMBERS=-m 3                   #total number of ensembles, must be the same as the N_GSHC_MEMBERS in rose-suite.conf
GS_YEAR_LIST=1997            #the year of the run
 
After you compiled the codes and run the job successfully, you could maintain your own INSTALL_DIR which is defined in suite.rc:
INSTALL_DIR = "/short/dx2/hxy599/gc2-install"
 
If you have any problems please let me know.
 
 
Regards,
Hailin

So I made a copy of that, new rose is au-aa566, most of the things were already set to the values that Hailin initiated in his email.
I've changed the INSTALL_DIR to /short/${PROJECT}/${USER}/gc2-install but I'm also not a member of the group dx2 or ub7, so I'm also trying to copy the DUMP_DIR and DUMP_DIR_BOM directories to my /short/${PROJECT}/${USER}/dump and /short/${PROJECT}/${USER}/dump-bom, respectively, but there is 27TB of data, and I can't do that.


Getting ACCESS-S to run


I've copied the job, and just tried to run it, but it failed with error messages, culminating in Illegal item: [scheduling]initial cycle time

The solution to this is to use older versions of CYLC and ROSE with this command:

$ CYLC_VERSION=6.9.1 ROSE_VERSION=2016.06.1 rosie go

First hurdles:


  1. gsfc_get_analysis gets a submit-failed
  2. GSHC_M1-3 get failed

For now, I've reset the suite.rc to point to the BoM directories, to see whether that changes anything -- It didn't

Looking at the job activity log and the job itself of gsfc_get_analysis, I notice strange PBS directives: ConsumableMemory(2GB) and wall_clock_limit. I find these strings in suite.rc, and replace them with -l vmem=2GB and -l walltime=01:11:00. (I also find another reference to these Values for glosea_joi_prods, and change them as well.)

This seems to have succeeded for gsfc_get_analysis, but the GSHC_M1-3 still fail. I found this error message:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR        ???!!!???!!!???!!!???!!!???!!!???!!!
?  Error   Code:    19
?  Error   Message:  Error reading namelist NLSTCALL. Please check input list against code.
?  Error   from processor:     0
?  Error   number:     0
????????????????????????????????????????????????????????????????????????????????

It seems in the namelist entered, there's a value for control_resubmit, which the UM doesn't understand. Since rose considers this variable to be compulsory, I've had to remove it from the file ~roses/au-aa566/app/coupled/rose-app.conf, and now I've submitted it again. (Or I could have disabled all metadata from the menu option...)

Second issues:


gsfc_get_analysis fails at the end, but it seems that it's not doing all that much:

======================================================================================
                  Resource Usage on 2017-06-06 15:39:23:
   Job Id:             5497709.r-man2
   Project:            w35
   Exit Status:        1
   Service Units:      1.24
   NCPUs Requested:    1                      NCPUs Used: 1
                                           CPU Time Used: 00:00:03
   Memory Requested:   500.0MB               Memory Used: 9.56MB
   Walltime requested: 01:11:00            Walltime Used: 01:14:15
   JobFS requested:    100.0MB                JobFS used: 0B
======================================================================================

CPU time used is only 3 seconds, while it ran out of walltime after almost 1h15m.

So it seems that, since SUITE_TYPE is set to research (and thereby GS_SUITE_TYPE is also research, some environment variables are set to directories that might exist on the MetOffice computer, but not on raijin:

{%- if RUN_GSFC or RUN_GSMN %}
    [[gsfc_get_analysis]]
        environment scripting = """eval $(rose task-env)
                                   export SHORT_DATE=${ROSE_TASK_CYCLE_TIME%%00}"""
        [[[environment]]]
            ANALYSES_DATADIR = ${ROSE_DATAC}/analyses/${ROSE_TASK_CYCLE_TIME}
            ROSE_TASK_APP    = glosea_get_fcst_analyses
            {% if GS_SUITE_TYPE != 'research' %}
              FOAM_SUITE_NAME   = $(os_get_suiteid --mode={{ SUITE_TYPE }} ocean)
              GLOBAL_SUITE_NAME = $(os_get_suiteid --mode={{ SUITE_TYPE }} global)
              ROSE_DATAC_GLOBAL = ${ROSE_DATAC/$CYLC_SUITE_NAME/$GLOBAL_SUITE_NAME}
              ROSE_DATAC_FOAM   = ${ROSE_DATAC/$CYLC_SUITE_NAME/$FOAM_SUITE_NAME}
            {% else %}
              ROSE_DATAC_GLOBAL = /critical/opfc/suites-oper/global/share/data/${ROSE_TASK_CYCLE_TIME}
              ROSE_DATAC_FOAM   = /critical/opfc/suites-oper/ocean/share/data/${ROSE_TASK_CYCLE_TIME}
            {% endif %}
        [[[directives]]]
            -l = "vmem=2GB,walltime=01:11:00"
#            resources        = ConsumableMemory(2Gb)
#            wall_clock_limit = "01:11:00,01:10:00"

For now I replaced the else clause above with the same data from as the original and try again.

Full Reset


Scott noticed that there were some new changes to the configuration file, namely RUN_GSFC and RUN_GSMN were set to true.

Since I couldn't remember ever changing them, I just made a full reset, changed and only changed the project.