- SLURM hold jobs
- Submit a job
- Run a job
- External Scheduler
- Catalina Install
- Catalina Commands
We want to disable the SLURM internal scheduler. We do this by configuring the Wiki interface of SLURM.
- edit /etc/slurm-llnl/slurm.conf
- edit /etc/slurm-llnl/wiki.conf
- restart slurmctld
... SchedulerType=sched/wiki SchedulerPort=7321 ...
AuthKey=<key>
With the Wiki interface active, submitted jobs go into JobHeld (Priority=0). Now, we can use SLURM commands to start jobs.
#!/bin/sh #SBATCH --time=10 #SBATCH --nodes=2 #SBATCH --tasks-per-node=8 /bin/hostname /usr/local/bin/srun -l /bin/hostname /usr/local/bin/srun -l /bin/pwd /home/kenneth/info/openmpi/install/bin/mpirun \ /home/kenneth/testprog/ring26 -t 10 -n 3 -l 100 \ -i 0.03125 -c 0 -s 0 /bin/sleep 900
/usr/local/bin/sbatch -o out_%j_%t r.script
We can use the scontrol command to:
- specify which nodes are used by the job
- specify a non-zero Priority to tell SLURM to run the job
/usr/local/bin/scontrol update JobId=<job id> \ ReqNodeList=<nodelist> Priority=1
Here's output from issuing the scontrol update command to a held job:
http://users.sdsc.edu/~kenneth/ipn.2010/workshop/simple1.txt
Instead of doing this by hand, we can write a script to do it for us. In this case, a Python script:
Download the latest tar ball from http://www.sdsc.edu/catalina
Unpack the software into a build directory:
tar zxf catalina.tar.gz cd catalina
./conf.sh
make make install
make config
Copy the catalina.config file to the install directory
Go to the Catalina home directory:
cd <installdir>
./initialize_dbs
Start the scheduler:
./start.ksh
You can also do
./set_config --key_value=state:running ./catalina_schedule_jobs --iterate
and this will print to STDOUT, staying in foreground.
Display the queue:
./show_q | more
Show reservations:
./show_res | more
./set_res --user_list=testdemo,diegella \ --start=00:00_12/25/2010 --end=01:00_01/25/2011 \ --resource_amount=3 --mode=real
--feature_list=CPUs8,MEM32 can be used to restrict nodes
Note the res_id given for the reservation:
... reservation 996775412 created on 32 nodes with start_time 1001376000.0 for duration 3600.0 ...
./show_res --res=996775412 --readable --start \ --end --duration --job_rest --node_list
For 4 days from now, lasting 8 hours on all configured nodes:
./create_system_res --offset=345600 --duration=28800
For 3:31AM Jan 15, 2011 TZ time, lasting 8 hours on all configured nodes:
./create_system_res --start=03:31_01/15/2011 \ --duration=28800
On 2 nodes:
./create_system_res --offset=345600 --duration=28800 \ --resource_amount=2
On nodes popocatzin2, popocatzin3:
./create_system_res --offset=345600 --duration=28800 \ --node_list=popocatzin2,popocatzin3
./cancel_res --res_id=<reservation id>
- Reservations that are re-made on a regular basis
./create_standing_res --start_spec='0 14 * * *' \ --depth=5 --duration=64800 --resource_amount=2 \ --job_restriction="if input_tuple[0]['job_class'] \ == 'trabajoslyn' and input_tuple[0]['wall_clock_limit'] \ <= 900 : result = 0" --mode=real
./show_standing_res --job_rest
./update_standing_reservations
./show_res --purpose --readable --start --end \ --job_rest | grep trabajoslyn
./cancel_standing_res --res=<standing reservation id>
- Guarantee that a fixed number of nodes will be available within a maximum configured wait time
- For important jobs, will provide assured queue wait time
./create_standing_res --start_spec='0 14 * * *' \ --depth=5 --duration=64800 --resource_amount=2 \ --job_restriction="if input_tuple[0]['job_class'] \ == 'trabajoslyn' and input_tuple[0]['wall_clock_limit'] \ <= 900 : result = 0" --mode=real \ --latency=7200 --overlap_running=1
./update_system_priority --job=<jobid> \ --system_priority=10
./update_system_priority --job=<jobid> --system_priority=0
./show_bf
For 'legion' user, 'normal' class, 'met200' group, 'met200' account, '6' QOS job:
./show_bf --username=legion --class=normal \ --group=met200 --account=met200 --qos=6
./show_res --overlap --purpose --comment --readable \ --start --end
./show_res --nodegrep --purpose --comment --readable \ --start --end | grep popocatzin2
./stop.py
- users allowed to create reservations
- user-settable reservations are limited by policies
./user_set_res --account=<account> \ --nodes=<number of nodes> \ --duration=<seconds duration> \ --earliest_start=<earliest start HH:MM_mm/dd/yyyy or \ epoch seconds> \ --latest_end=<latest end HH:MM_mm/dd/yyyy or epoch seconds> \ --email=<email address user@domain> \ [--sharedmap=<1#type:node_shared#cpu:1+memory:1>] \ [--featurelist=<comma-delimited list of node features>] \ [--qoslist=<comma-delimited list of allowed QOS>]
user_set_res --account=sds122 --nodes=8 \ --duration=3600 --earliest_start=13:40_11/16/2010 --latest_end=23:00_11/16/2010 --email=<email address>
./user_cancel_res <reservation id>
./user_bind_res <reservation id> \ <comma-delimited list of jobs>
./user_unbind_res <reservation id> \ <comma-delimited list of jobs>
./user_bind_job <job id> \ <comma-delimited list of reservations>
./user_unbind_job <job id> \ <comma-delimited list of reservations>
- Priority Calculation
- Policies - running - queued - soft - node-seconds - user-settable reservations
- Extensibility
single queue
priority calculation
QOS
expansion factor
fairshare adjustment
max priority adjustment
priority = resource_number * Resource_Weight + wall_clock_time * Wall_Time_Weight + QOS_priority * QOS_Priority_Weight + local_user_float * Local_User_Weight + local_admin_float * Local_Admin_Weight + expansion_factor * Expansion_Factor_Weight + queue_wait_time * System_Queue_Time_Weight + submit_wait_time * Submit_Time_Weight + QOS_target_xf_value * QOS_Target_Expansion_Factor_Weight + QOS_target_qwt_value * QOS_Target_Queue_Wait_Time_Weight
# each QOS has a starting priority value
QOS_PRIORITY_STRING = { '0' : 0L,
'1' : 10L,
'2' : 20L,
'3' : 30L,
'4' : 40L,
'5' : 50L,
'6' : 10000000000L,
'7' : 40L,
'8' : 40L,
'9' : 10000000000000000L,
'10' : 5L,
'11' : 10000000000L
}
Expansion Factor
(requested time + system queue time)/requested time
Target Expansion Factor
1/(target expansion factor - current expansion factor)
- fairshare_value for each job
- Configurable:
- TOTAL_AVAILABLE
- THRESHOLD_PERCENTAGE
- PENALTY_PERCENTAGE
if fairshare_value / TOTAL_AVAILABLE > THRESHOLD_PERCENTAGE/100 priority = priority - (PENALTY_PERCENTAGE * priority / 100)
# each QOS has a max priority
QOS_MAX_PRIORITY_STRING = { '0' : 100000000L,
'1' : 1000000000L,
'2' : 1000000000L,
'3' : 1000000000L,
'4' : 1000000000L,
'5' : 1000000000L,
'6' : 1000000000000000L,
'7' : 10000000000L,
'8' : 10000000000L,
'9' : 10000000000000000000L,
'10' : 1000000000000000L,
'11' : 100000000000000L
}
priority = 1/
(target expansion factor - current expansion factor)
priority = (priority * max_pri * 10)/
(priority * 10 + max_pri)
FAIRSHARE_BONUS_WEIGHT=0 PENALTY_PERCENTAGE=0.0 TOTAL_AVAILABLE = 4800.0 THRESHOLD_PERCENTAGE=60.0 RESOURCE_WEIGHT=250.0 EXPANSION_FACTOR_WEIGHT=1.0 SYSTEM_QUEUE_TIME_WEIGHT=0.030 SUBMIT_TIME_WEIGHT=0.0000001 LOCAL_USER_WEIGHT=0.0 LOCAL_ADMIN_WEIGHT=0.004 WALL_TIME_WEIGHT=0.0 QOS_PRIORITY_WEIGHT = 425.0 QOS_TARGET_EXPANSION_FACTOR_WEIGHT = 1.0 QOS_TARGET_QUEUE_WAIT_TIME_WEIGHT = 1.0
- running
- queued
- soft
- node-seconds
- user-settable reservations
# each QOS may have a max running jobs per user policy
QOS_MAXJOBPERUSERPOLICY_STRING = { '0' : 4,
'1' : 12,
'2' : None,
'3' : 2,
'4' : 2,
'5' : 2,
'6' : 4,
'7' : 2,
'8' : 2,
'9' : None,
'10' : 4,
'11' : 4
}
# each QOS may have a max queued jobs per user policy
QOS_MAXJOBQUEUEDPERUSERPOLICY_STRING = { '0' : 4,
'1' : 12,
'2' : None,
'3' : 4,
'4' : 4,
'5' : 4,
'6' : 4,
'7' : 4,
'8' : 4,
'9' : None,
'10' : 4,
'11' : 4
}
# each QOS may have a max running jobs per user policy
QOS_MAXJOBPERUSERPOLICY_STRING = { '0' : None,
'1' : None,
'2' : None,
'3' : None,
'4' : None,
'5' : None,
'6' : None,
'7' : None,
'8' : None,
'9' : None,
'10' : None,
'11' : None
}
# each QOS may have a max queued node-seconds
per account policy
QOS_MAXNODESECQUEUEDPERACCOUNTPOLICY_STRING =
{ '0' : None,
'1' : None,
'2' : None,
'3' : None,
'4' : None,
'5' : None,
'6' : None,
'7' : None,
'8' : None,
'9' : None,
'10' : None,
'11' : 2764800
}
USER_SET_LIMITS_DICT_STRING = {
'DEFAULT' : { 'instances_int' : 2,
'nodes_int' : 8,
'seconds_int' : 3600
},
'sys200' : { 'instances_int' : -1,
'nodes_int' : -1,
'seconds_int' : -1
},
'GLOBAL' : { 'window' : 14400,
'ABSOLUTE_LIMIT' : 28800,
'REQUIREDFEATURESLIST' : ['exclusive',] }
}
- Node selection
- Job filtering
- Conflict policy
Python code, either on the command line or in a file, to be used to filter nodes for allocation to the reservation. input_tuple[0] is the node under consideration. Set 'result = 0' if the node is approved for allocation within the reservation. By default, nodes in Down, Drain, Drained, None, Unknown, or with Max_Starters == 0 are rejected. All other nodes are accepted.
create_res [--node_restriction=<python code> | --node_restriction_file=<filename>]
create_res --node_restriction="if input_tuple[0]['State'] in ['Idle','Running']: result=0"
Python code, either on the command line or in a file, to be used to filter jobs for running within the reservation. input_tuple[0] is the job under consideration. Set 'result = 0' if the job is approved for run within the reservation. By default, 'result = 1', no job may run in the reservation.
create_res [--job_restriction=<python code> | --job_restriction_file=<filename>]
Python code, either on the command line or in a file, to be used to return open time windows for each node. input_tuple[0] is a list of accepted nodes. input_tuple[1] is the new reservation, with attributes 'earliest_start_float' and 'duration_float' in epoch time and seconds input_tuple[2] is the list of existing reservations. return a dictionary. Node names are keys. A list of tuples containing (float epoch start of window, float epoch end of window, node name) are the values. By default, only time windows that do not conflict with the existing reservations are returned.
create_res [--conflict_policy=<python code> | --conflict_policy_file=<filename>]
schedulerqueen@gmail.com kenneth@sdsc.edu
http://users.sdsc.edu/~kenneth/ipn.2010/workshop/slideshow.html