Catalina Scheduler

Outline

SLURM hold jobs

Submit a job

Run a job

External Scheduler

Catalina Install

Catalina Commands

SLURM

We want to disable the SLURM internal scheduler. We do this by configuring the Wiki interface of SLURM.

edit /etc/slurm-llnl/slurm.conf

edit /etc/slurm-llnl/wiki.conf

restart slurmctld

slurm.conf

...
SchedulerType=sched/wiki
SchedulerPort=7321
...

wiki.conf

AuthKey=<key>

SLURM Jobs Held

With the Wiki interface active, submitted jobs go into JobHeld (Priority=0). Now, we can use SLURM commands to start jobs.

SLURM Job

#!/bin/sh
#SBATCH --time=10
#SBATCH --nodes=2
#SBATCH --tasks-per-node=8
/bin/hostname
/usr/local/bin/srun -l /bin/hostname
/usr/local/bin/srun -l /bin/pwd
/home/kenneth/info/openmpi/install/bin/mpirun \
/home/kenneth/testprog/ring26 -t 10 -n 3 -l 100 \
-i 0.03125 -c 0 -s 0
/bin/sleep 900

Submit Job

/usr/local/bin/sbatch -o out_%j_%t r.script

Held Jobs

squeue shows jobs in JobHeld:

http://users.sdsc.edu/~kenneth/ipn.2010/workshop/held.txt

Run Job

We can use the scontrol command to:

specify which nodes are used by the job

specify a non-zero Priority to tell SLURM to run the job

/usr/local/bin/scontrol update JobId=<job id> \
ReqNodeList=<nodelist> Priority=1

Query Job

Here's output from issuing the scontrol update command to a held job:

http://users.sdsc.edu/~kenneth/ipn.2010/workshop/simple1.txt

Simple Scheduler

Instead of doing this by hand, we can write a script to do it for us. In this case, a Python script:

http://users.sdsc.edu/~kenneth/ipn.2010/workshop/sched.py

Catalina Download

Download the latest tar ball from http://www.sdsc.edu/catalina

Catalina Unpack

Unpack the software into a build directory:

tar zxf catalina.tar.gz
cd catalina

Set Env Variables

http://users.sdsc.edu/~kenneth/ipn.2010/workshop/env.txt

Configure Makefile

./conf.sh

Build and install

make
make install

catalina.config

make config

Copy the catalina.config file to the install directory

Initialize database files

Go to the Catalina home directory:

cd <installdir>

./initialize_dbs

Starting/Stopping Catalina

Start the scheduler:

./start.ksh

You can also do

./set_config --key_value=state:running
./catalina_schedule_jobs --iterate

and this will print to STDOUT, staying in foreground.

Catalina Queries

Display the queue:

./show_q | more

Show reservations:

./show_res | more

Create a reservation

./set_res --user_list=testdemo,diegella \
--start=00:00_12/25/2010 --end=01:00_01/25/2011 \
--resource_amount=3 --mode=real

--feature_list=CPUs8,MEM32 can be used to restrict nodes

Note the res_id given for the reservation:

...
reservation 996775412 created on 32 nodes with start_time
1001376000.0 for duration 3600.0
...

Check the reservation:

./show_res --res=996775412 --readable --start \
--end --duration --job_rest --node_list

Create a system reservation

For 4 days from now, lasting 8 hours on all configured nodes:

./create_system_res --offset=345600 --duration=28800

Create a system reservation

For 3:31AM Jan 15, 2011 TZ time, lasting 8 hours on all configured nodes:

./create_system_res --start=03:31_01/15/2011 \
--duration=28800

System reservation

On 2 nodes:

./create_system_res --offset=345600 --duration=28800 \
--resource_amount=2

System reservation

On nodes popocatzin2, popocatzin3:

./create_system_res --offset=345600 --duration=28800 \
--node_list=popocatzin2,popocatzin3

Cancel a reservation

./cancel_res --res_id=<reservation id>

Standing reservations

Reservations that are re-made on a regular basis

Standing reservations policy

Create the Interactive Standing reservation

./create_standing_res --start_spec='0 14 * * *' \
--depth=5 --duration=64800 --resource_amount=2 \
--job_restriction="if input_tuple[0]['job_class'] \
== 'trabajoslyn' and input_tuple[0]['wall_clock_limit'] \
<= 900 : result = 0" --mode=real

Check interactive standing reservation

./show_standing_res --job_rest

Create Standing res instances

./update_standing_reservations

Check for standing reservation instances:

./show_res --purpose --readable --start --end \
--job_rest | grep trabajoslyn

Delete a standing reservation

./cancel_standing_res --res=<standing reservation id>

Shortpool reservations

Guarantee that a fixed number of nodes will be available within a maximum configured wait time

For important jobs, will provide assured queue wait time

Create a Shortpool Standing reservation

./create_standing_res --start_spec='0 14 * * *' \
--depth=5 --duration=64800 --resource_amount=2 \
--job_restriction="if input_tuple[0]['job_class'] \
== 'trabajoslyn' and input_tuple[0]['wall_clock_limit'] \
<= 900 : result = 0" --mode=real \
--latency=7200 --overlap_running=1

How does shortpool work?

Set system priority for a job

./update_system_priority --job=<jobid> \
--system_priority=10

Unset system priority for a job

./update_system_priority --job=<jobid> --system_priority=0

Show backfill windows

./show_bf

Show backfill windows

For 'legion' user, 'normal' class, 'met200' group, 'met200' account, '6' QOS job:

./show_bf --username=legion --class=normal \
--group=met200 --account=met200 --qos=6

Show reservation overlaps

./show_res --overlap --purpose --comment --readable \
--start --end

Show all reservations on popocatzin2

./show_res --nodegrep --purpose --comment --readable \
--start --end | grep popocatzin2

Stop scheduling

./stop.py

User-settable Reservations

users allowed to create reservations

user-settable reservations are limited by policies

./user_set_res --account=<account> \
--nodes=<number of nodes> \
--duration=<seconds duration> \
--earliest_start=<earliest start HH:MM_mm/dd/yyyy or \
epoch seconds> \
--latest_end=<latest end HH:MM_mm/dd/yyyy or
epoch seconds> \
--email=<email address user@domain> \
[--sharedmap=<1#type:node_shared#cpu:1+memory:1>] \
[--featurelist=<comma-delimited list of node features>] \
[--qoslist=<comma-delimited list of allowed QOS>]

Create a user reservation:

user_set_res --account=sds122 --nodes=8 \
--duration=3600 --earliest_start=13:40_11/16/2010
--latest_end=23:00_11/16/2010 --email=<email address>

Cancel own user reservation:

./user_cancel_res <reservation id>

Bind the reservation to a job:

./user_bind_res <reservation id> \
<comma-delimited list of jobs>

Unbind the reservation from a job:

./user_unbind_res <reservation id> \
<comma-delimited list of jobs>

Bind the job to reservations:

./user_bind_job <job id> \
<comma-delimited list of reservations>

Unbind the reservation from a job:

./user_unbind_job <job id> \
<comma-delimited list of reservations>

Catalina Configuration

Priority Calculation

Policies - running - queued - soft - node-seconds - user-settable reservations

Extensibility

Job Queue Priority

single queue

priority calculation

QOS

expansion factor

fairshare adjustment

max priority adjustment

Priority Calculation

priority =
resource_number      * Resource_Weight
+ wall_clock_time    * Wall_Time_Weight
+ QOS_priority       * QOS_Priority_Weight
+ local_user_float   * Local_User_Weight
+ local_admin_float  * Local_Admin_Weight
+ expansion_factor   * Expansion_Factor_Weight
+ queue_wait_time    * System_Queue_Time_Weight
+ submit_wait_time   * Submit_Time_Weight
+ QOS_target_xf_value
  * QOS_Target_Expansion_Factor_Weight
+ QOS_target_qwt_value
  * QOS_Target_Queue_Wait_Time_Weight

Query Priority

./query_priority

query output

QOS Priority

# each QOS has a starting priority value
QOS_PRIORITY_STRING = { '0' : 0L,
               '1' : 10L,
               '2' : 20L,
               '3' : 30L,
               '4' : 40L,
               '5' : 50L,
               '6' : 10000000000L,
               '7' : 40L,
               '8' : 40L,
               '9' : 10000000000000000L,
              '10' : 5L,
              '11' : 10000000000L
              }

Expansion factor

Expansion Factor

(requested time + system queue time)/requested time

Target Expansion Factor

1/(target expansion factor - current expansion factor)

Target Expansion Factor (example)

target xf

Fairshare interface

fairshare_value for each job

Configurable:

TOTAL_AVAILABLE

THRESHOLD_PERCENTAGE

PENALTY_PERCENTAGE

Fairshare interface (calculations)

if fairshare_value / TOTAL_AVAILABLE >
   THRESHOLD_PERCENTAGE/100

priority = priority - (PENALTY_PERCENTAGE * priority / 100)

Fairshare (example)

priority = 900

fairshare

QOS Max Priority

# each QOS has a max priority
QOS_MAX_PRIORITY_STRING = { '0' : 100000000L,
               '1' : 1000000000L,
               '2' : 1000000000L,
               '3' : 1000000000L,
               '4' : 1000000000L,
               '5' : 1000000000L,
               '6' : 1000000000000000L,
               '7' : 10000000000L,
               '8' : 10000000000L,
               '9' : 10000000000000000000L,
              '10' : 1000000000000000L,
              '11' : 100000000000000L
              }

Maximum priority adjustment

priority = 1/
      (target expansion factor - current expansion factor)

priority =  (priority * max_pri * 10)/
            (priority * 10 + max_pri)

Maximum priority adjustment (example)

max priority

Catalina Priority weights

FAIRSHARE_BONUS_WEIGHT=0
PENALTY_PERCENTAGE=0.0
TOTAL_AVAILABLE = 4800.0
THRESHOLD_PERCENTAGE=60.0
RESOURCE_WEIGHT=250.0
EXPANSION_FACTOR_WEIGHT=1.0
SYSTEM_QUEUE_TIME_WEIGHT=0.030
SUBMIT_TIME_WEIGHT=0.0000001
LOCAL_USER_WEIGHT=0.0
LOCAL_ADMIN_WEIGHT=0.004
WALL_TIME_WEIGHT=0.0
QOS_PRIORITY_WEIGHT = 425.0
QOS_TARGET_EXPANSION_FACTOR_WEIGHT = 1.0
QOS_TARGET_QUEUE_WAIT_TIME_WEIGHT = 1.0

Policies

running

queued

soft

node-seconds

user-settable reservations

Policies (running)

# each QOS may have a max running jobs per user policy
QOS_MAXJOBPERUSERPOLICY_STRING = { '0' : 4,
               '1' : 12,
               '2' : None,
               '3' : 2,
               '4' : 2,
               '5' : 2,
               '6' : 4,
               '7' : 2,
               '8' : 2,
               '9' : None,
              '10' : 4,
              '11' : 4
              }

Policies (queued)

# each QOS may have a max queued jobs per user policy
QOS_MAXJOBQUEUEDPERUSERPOLICY_STRING = { '0' : 4,
               '1' : 12,
               '2' : None,
               '3' : 4,
               '4' : 4,
               '5' : 4,
               '6' : 4,
               '7' : 4,
               '8' : 4,
               '9' : None,
              '10' : 4,
              '11' : 4
              }

Policies (soft)

# each QOS may have a max running jobs per user policy
QOS_MAXJOBPERUSERPOLICY_STRING = { '0' : None,
               '1' : None,
               '2' : None,
               '3' : None,
               '4' : None,
               '5' : None,
               '6' : None,
               '7' : None,
               '8' : None,
               '9' : None,
              '10' : None,
              '11' : None
              }

Policies (node-seconds)

# each QOS may have a max queued node-seconds
per account policy
QOS_MAXNODESECQUEUEDPERACCOUNTPOLICY_STRING =
             { '0' : None,
               '1' : None,
               '2' : None,
               '3' : None,
               '4' : None,
               '5' : None,
               '6' : None,
               '7' : None,
               '8' : None,
               '9' : None,
              '10' : None,
              '11' : 2764800
              }

User reservation policies

USER_SET_LIMITS_DICT_STRING = {
  'DEFAULT' : { 'instances_int' : 2,
                   'nodes_int' : 8,
                 'seconds_int' : 3600
             },
  'sys200' : { 'instances_int' : -1,
                   'nodes_int' : -1,
                 'seconds_int' : -1
             },
  'GLOBAL' : { 'window' : 14400,
       'ABSOLUTE_LIMIT' : 28800,
       'REQUIREDFEATURESLIST' : ['exclusive',] }
  }

Global window policy

Queue Disruption

Extensibility

Node selection

Job filtering

Conflict policy

Node selection

Python code, either on the command line or in a file, to be used to filter nodes for allocation to the reservation. input_tuple[0] is the node under consideration. Set 'result = 0' if the node is approved for allocation within the reservation. By default, nodes in Down, Drain, Drained, None, Unknown, or with Max_Starters == 0 are rejected. All other nodes are accepted.

Node selection (usage)

create_res [--node_restriction=<python code> | --node_restriction_file=<filename>]

Node selection (example)

create_res --node_restriction="if input_tuple[0]['State'] in ['Idle','Running']: result=0"

Job filtering

Python code, either on the command line or in a file, to be used to filter jobs for running within the reservation. input_tuple[0] is the job under consideration. Set 'result = 0' if the job is approved for run within the reservation. By default, 'result = 1', no job may run in the reservation.

Job filtering (usage)

create_res [--job_restriction=<python code> | --job_restriction_file=<filename>]

Conflict policy

Python code, either on the command line or in a file, to be used to return open time windows for each node. input_tuple[0] is a list of accepted nodes. input_tuple[1] is the new reservation, with attributes 'earliest_start_float' and 'duration_float' in epoch time and seconds input_tuple[2] is the list of existing reservations. return a dictionary. Node names are keys. A list of tuples containing (float epoch start of window, float epoch end of window, node name) are the values. By default, only time windows that do not conflict with the existing reservations are returned.

Conflict policy (usage)

create_res [--conflict_policy=<python code> | --conflict_policy_file=<filename>]

Debug

Debug job start failures

Contact

schedulerqueen@gmail.com kenneth@sdsc.edu

http://users.sdsc.edu/~kenneth/ipn.2010/workshop/slideshow.html

http://www.sdsc.edu/catalina