Debugging jobs. User calls in: job 927 is not starting.
Check show_q. Output should show the job and estimated start time (RES_START). If show_q shows that the RES_START has a negative start time, catalina is trying to start the job. Verify that Last schedule update is recent (within ~5 minutes of date command).
catalina@popocatzin:~/catalina/install$ ./show_q
SERVERMODE (NORMAL)
No TZ found
Last schedule update (12/07/2010 17:51:23)
ACTIVE JOBS--------------------
JOBNAME USER STATE NODES REMAINING STARTTIME LIMIT CLASS RISK
158 kenneth Running 2 00:07:06 12/07 17:48 00:10:00 trabajoslyn 0
Allocated Nodes: (2)
Allocated Cpus: (16)
Allocated Memory: (2)
IDLE JOBS--------------------
JOBNAME USER STATE NODES LIMIT QUEUETIME CLASS RISK RES_START RES_BIND
159 kenneth Idle 2 00:10:00 12/07 17:48 trabajoslyn 0 -00:01:37[]
160 kenneth Idle 2 00:10:00 12/07 17:48 trabajoslyn 0 00:12:04[]
161 kenneth Idle 2 00:10:00 12/07 17:48 trabajoslyn 0 00:13:22[]
162 kenneth Idle 2 00:10:00 12/07 17:48 trabajoslyn 0 00:27:04[]
163 kenneth Idle 2 00:10:00 12/07 17:48 trabajoslyn 0 00:28:22[]
164 kenneth Idle 2 00:10:00 12/07 17:48 trabajoslyn 0 00:42:04[]
NON-QUEUED JOBS--------------------
JOBNAME USER STATE NODES LIMIT CLASS RISK REASON
157 kenneth Idle 2 00:10:00 trabajoslyn 0 QOSMAXJOBQUEUEDPERUSERPOLICY
165 kenneth Idle 2 00:10:00 trabajoslyn 0 QOSMAXJOBQUEUEDPERUSERPOLICY
166 kenneth Idle 2 00:10:00 trabajoslyn 0 QOSMAXJOBQUEUEDPERUSERPOLICY
HOTPAGE: 16 40
catalina@popocatzin:~/catalina/install$ date
Tue Dec 7 17:51:27 CST 2010
catalina@popocatzin:~/catalina/install$
Last schedule update was 12/07/2010 17:51:23. This indicates the latest time catalina finished a complete scheduling run. The current time from the date command was Tue Dec 7 17:51:27 CST 2010, so we are current.
Check show_events to verify job start attempt: rc from scontrol update should be '0'.
catalina@popocatzin:~/catalina/install$ ./show_events | more
SERVERMODE (NORMAL)
--------------------------------------------------------------------------------
Tue Dec 7 17:49:47 2010
cmd : /home/catalina/catalina/install/rj_SLURM 159 popocatzin6\
popocatzin3\
name : run_jobs
return_string : ['SERVERMODE (NORMAL)\n', "(159) (['popocatzin6', 'popocatzin3']
)\n", 'rc from scontrol update is >0<\n']
SERVERMODE (NORMAL)
--------------------------------------------------------------------------------
Tue Dec 7 17:20:15 2010
cmd : /home/catalina/catalina/install/rj_SLURM 156 popocatzin4\
popocatzin3\
name : run_jobs
return_string : ['SERVERMODE (NORMAL)\n', "(156) (['popocatzin4', 'popocatzin3']
)\n", 'rc from scontrol update is >0<\n']
--------------------------------------------------------------------------------
Tue Dec 7 17:19:45 2010
cmd : /home/catalina/catalina/install/rj_SLURM 155 popocatzin5\
popocatzin6\
name : run_jobs
return_string : ['SERVERMODE (NORMAL)\n', "(155) (['popocatzin5', 'popocatzin6']
)\n", 'rc from scontrol update is >0<\n']
--------------------------------------------------------------------------------
Check reservation for 159:
catalina@popocatzin:~/catalina/install$ ./show_res --full --node_list | grep 159
1291765711 -0.09 0.16 0.25 2 16 2 159 job catalina ( popocatzin6 popocatzin3)
catalina@popocatzin:~/catalina/install$
First column is reservation id '1291765711'.
catalina@popocatzin:~/catalina/install$ ./show_res --full --node_list | grep 159
1291765711 -0.09 0.16 0.25 2 16 2 159 job catalina ( popocatzin6 popocatzin3)
catalina@popocatzin:~/catalina/install$ ./show_res --res=1291765711 --start --end --readable --node_list
SERVERMODE (NORMAL)
No TZ found
ResID StartTime EndTime NodeList
1291765711 Tue Dec 7 17:49:47 2010 Tue Dec 7 18:04:47 2010 popocatzin6 popocatzin3
catalina@popocatzin:~/catalina/install$ date
Tue Dec 7 17:57:35 CST 2010
catalina@popocatzin:~/catalina/install$
The reservation is active now, having started at Tue Dec 7 17:57:35 2010 on nodes popocatzin6, popocatzin3.
Check node status:
catalina@popocatzin:~$ scontrol --oneliner show node | grep -e popocatzin3 -e popocatzin6
NodeName=popocatzin3 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null) OS=Linux RealMemory=1 Sockets=8 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Reason=(null)
NodeName=popocatzin6 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=8 Features=(null) OS=Linux RealMemory=1 Sockets=8 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Reason=(null)
catalina@popocatzin:~$
So, popocatzin3 is down. Fix popocatzin3.
Catalina has a RESOURCE_DOWNTIME_LIMIT that considers a node up, for 5 min before deciding it's down. After about 5 min, the node should be recognized as down, and the job will be rescheduled on other nodes, possibly delayed. In catalina.config, this can be configured: RESOURCE_DOWN_TIME_LIMIT=900.0
It's also possible to run catalina in foreground to see error output:
catalina@popocatzin:~/catalina/install$ ./catalina_schedule_jobs SERVERMODE (NORMAL) iteration (0) Catalina.Now_float (Tue Dec 7 18:05:04 2010) 0.02 Waiting for configuration db lock Received configuration db lock temp_dict[my_hostname] ([1239, 12745]) Released configuration db write lock Waiting for configured_resources db lock Received configured_resources db lock Waiting for events db lock Received events db lock Waiting for jobs db lock Received jobs db lock Waiting for old_jobs db lock Received old_jobs db lock Waiting for old_reservations db lock Received old_reservations db lock Waiting for reservations db lock Received reservations db lock Waiting for resource db lock Received resource db lock Waiting for standing_reservations db lock Received standing_reservations db lock moving old jobs moving old reservations rolling logs refreshing node dictionary after 1 intervals update_resources and filter out bad nodes from job list RESOURCE_DOWN_TIME_LIMIT exceeded for (popocatzin3)! update_job_info cancel_overrun_jobs cancel_bad_jobs update_job_resource_lists update_resource_usability update_job_speculative_system_queue_time update_job_priorities get_eligible_and_running_jobs update_running_reservations updating standing reservations migrating shortpools create_job_reservations job reservation (1291765711) failed for runID (159) due to insufficient resources unspecified exception () () ( ) File "/home/catalina/catalina/install/Catalina.py", line 4649, in create_job_reservations raise ResIsNone() job reservation (1291765707) failed for runID (157) due to insufficient resources unspecified exception ( ) () ( ) File "/home/catalina/catalina/install/Catalina.py", line 4649, in create_job_reservations raise ResIsNone() cancel run-at-risk jobs run_jobs in run_jobs len(jobs_to_run.keys()) is (0) elapsed (0.905797958374) Waiting for configuration db lock Received configuration db lock temp_pidslist ([1239]) Released configuration db write lock Released standing_reservations db read lock Released resource db write lock Released reservations db write lock Released old_reservations db write lock Released old_jobs db write lock Released jobs db write lock Released events db write lock Released configured_resources db write lock catalina@popocatzin:~/catalina/install$