Debugging job starts

Debugging jobs. User calls in: job 927 is not starting.

Check show_q. Output should show the job and estimated start time (RES_START). If show_q shows that the RES_START has a negative start time, catalina is trying to start the job. Verify that Last schedule update is recent (within ~5 minutes of date command).

catalina@popocatzin:~/catalina/install$ ./show_q
SERVERMODE (NORMAL)
No TZ found
Last schedule update (12/07/2010 17:51:23)
ACTIVE JOBS--------------------
 JOBNAME              USER       STATE    NODES  REMAINING STARTTIME   LIMIT     CLASS        RISK
                  158    kenneth  Running    2    00:07:06 12/07 17:48  00:10:00  trabajoslyn     0
Allocated Nodes: (2)
Allocated Cpus: (16)
Allocated Memory: (2)
IDLE JOBS--------------------
 JOBNAME              USER       STATE    NODES LIMIT     QUEUETIME   CLASS        RISK RES_START  RES_BIND
                  159    kenneth     Idle    2   00:10:00 12/07 17:48  trabajoslyn    0  -00:01:37[]
                  160    kenneth     Idle    2   00:10:00 12/07 17:48  trabajoslyn    0   00:12:04[]
                  161    kenneth     Idle    2   00:10:00 12/07 17:48  trabajoslyn    0   00:13:22[]
                  162    kenneth     Idle    2   00:10:00 12/07 17:48  trabajoslyn    0   00:27:04[]
                  163    kenneth     Idle    2   00:10:00 12/07 17:48  trabajoslyn    0   00:28:22[]
                  164    kenneth     Idle    2   00:10:00 12/07 17:48  trabajoslyn    0   00:42:04[]
NON-QUEUED JOBS--------------------
 JOBNAME              USER       STATE    NODES LIMIT     CLASS        RISK REASON
                  157    kenneth     Idle    2   00:10:00  trabajoslyn    0 QOSMAXJOBQUEUEDPERUSERPOLICY
                  165    kenneth     Idle    2   00:10:00  trabajoslyn    0 QOSMAXJOBQUEUEDPERUSERPOLICY
                  166    kenneth     Idle    2   00:10:00  trabajoslyn    0 QOSMAXJOBQUEUEDPERUSERPOLICY
HOTPAGE: 16 40
catalina@popocatzin:~/catalina/install$ date
Tue Dec  7 17:51:27 CST 2010
catalina@popocatzin:~/catalina/install$
    

Last schedule update was 12/07/2010 17:51:23. This indicates the latest time catalina finished a complete scheduling run. The current time from the date command was Tue Dec 7 17:51:27 CST 2010, so we are current.

Check show_events to verify job start attempt: rc from scontrol update should be '0'.


catalina@popocatzin:~/catalina/install$ ./show_events | more
SERVERMODE (NORMAL)
--------------------------------------------------------------------------------
Tue Dec  7 17:49:47 2010
cmd : /home/catalina/catalina/install/rj_SLURM  159  popocatzin6\
 popocatzin3\

name : run_jobs
return_string : ['SERVERMODE (NORMAL)\n', "(159) (['popocatzin6', 'popocatzin3']
)\n", 'rc from scontrol update is >0<\n']

SERVERMODE (NORMAL)
--------------------------------------------------------------------------------
Tue Dec  7 17:20:15 2010
cmd : /home/catalina/catalina/install/rj_SLURM  156  popocatzin4\
 popocatzin3\

name : run_jobs
return_string : ['SERVERMODE (NORMAL)\n', "(156) (['popocatzin4', 'popocatzin3']
)\n", 'rc from scontrol update is >0<\n']
--------------------------------------------------------------------------------
Tue Dec  7 17:19:45 2010
cmd : /home/catalina/catalina/install/rj_SLURM  155  popocatzin5\
 popocatzin6\

name : run_jobs
return_string : ['SERVERMODE (NORMAL)\n', "(155) (['popocatzin5', 'popocatzin6']
)\n", 'rc from scontrol update is >0<\n']
--------------------------------------------------------------------------------
    

Check reservation for 159:

catalina@popocatzin:~/catalina/install$ ./show_res --full --node_list | grep 159
1291765711        -0.09         0.16         0.25     2    16      2                  159                  job  catalina  ( popocatzin6 popocatzin3)
catalina@popocatzin:~/catalina/install$
    

First column is reservation id '1291765711'.

catalina@popocatzin:~/catalina/install$ ./show_res --full --node_list | grep 159
1291765711        -0.09         0.16         0.25     2    16      2                  159                  job  catalina  ( popocatzin6 popocatzin3)
catalina@popocatzin:~/catalina/install$ ./show_res --res=1291765711 --start --end --readable --node_list
SERVERMODE (NORMAL)
No TZ found
ResID       StartTime                 EndTime                   NodeList
1291765711  Tue Dec  7 17:49:47 2010  Tue Dec  7 18:04:47 2010   popocatzin6 popocatzin3
catalina@popocatzin:~/catalina/install$ date
Tue Dec  7 17:57:35 CST 2010
catalina@popocatzin:~/catalina/install$
    

The reservation is active now, having started at Tue Dec 7 17:57:35 2010 on nodes popocatzin6, popocatzin3.

Check node status:


catalina@popocatzin:~$ scontrol --oneliner show node | grep -e popocatzin3 -e popocatzin6
NodeName=popocatzin3 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null) OS=Linux RealMemory=1 Sockets=8 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Reason=(null)
NodeName=popocatzin6 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null) OS=Linux RealMemory=1 Sockets=8 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Reason=(null)
catalina@popocatzin:~$

    

So, popocatzin3 is down. Fix popocatzin3.

Catalina has a RESOURCE_DOWNTIME_LIMIT that considers a node up, for 5 min before deciding it's down. After about 5 min, the node should be recognized as down, and the job will be rescheduled on other nodes, possibly delayed. In catalina.config, this can be configured: RESOURCE_DOWN_TIME_LIMIT=900.0

It's also possible to run catalina in foreground to see error output:

catalina@popocatzin:~/catalina/install$ ./catalina_schedule_jobs
SERVERMODE (NORMAL)
iteration (0)
Catalina.Now_float (Tue Dec  7 18:05:04 2010)
0.02
Waiting for configuration db lock
Received configuration db lock
temp_dict[my_hostname] ([1239, 12745])
Released configuration db write lock
Waiting for configured_resources db lock
Received configured_resources db lock
Waiting for events db lock
Received events db lock
Waiting for jobs db lock
Received jobs db lock
Waiting for old_jobs db lock
Received old_jobs db lock
Waiting for old_reservations db lock
Received old_reservations db lock
Waiting for reservations db lock
Received reservations db lock
Waiting for resource db lock
Received resource db lock
Waiting for standing_reservations db lock
Received standing_reservations db lock
moving old jobs
moving old reservations
rolling logs
refreshing node dictionary after 1 intervals
update_resources and filter out bad nodes from job list
RESOURCE_DOWN_TIME_LIMIT exceeded for (popocatzin3)!
update_job_info
cancel_overrun_jobs
cancel_bad_jobs
update_job_resource_lists
update_resource_usability
update_job_speculative_system_queue_time
update_job_priorities
get_eligible_and_running_jobs
update_running_reservations
updating standing reservations
migrating shortpools
create_job_reservations
job reservation (1291765711) failed for runID (159) due to insufficient resources
unspecified exception () () ()
  File "/home/catalina/catalina/install/Catalina.py", line 4649, in create_job_reservations
    raise ResIsNone()
job reservation (1291765707) failed for runID (157) due to insufficient resources
unspecified exception () () ()
  File "/home/catalina/catalina/install/Catalina.py", line 4649, in create_job_reservations
    raise ResIsNone()
cancel run-at-risk jobs
run_jobs
in run_jobs
len(jobs_to_run.keys()) is (0)
elapsed (0.905797958374)
Waiting for configuration db lock
Received configuration db lock
temp_pidslist ([1239])
Released configuration db write lock
Released standing_reservations db read lock
Released resource db write lock
Released reservations db write lock
Released old_reservations db write lock
Released old_jobs db write lock
Released jobs db write lock
Released events db write lock
Released configured_resources db write lock
catalina@popocatzin:~/catalina/install$