Here is a snapshot of SDSC Gordon recorded at regular intervals using my nodeview program.

System Overview of Gordon

Total Nodes 896
Total Cores 14816
Total Jobs 356
Total Ranks 13028
Total Load 12173.9
Total SUs Running 686818
Total SUs Queued 447619
  Current Max %
Node Availability 884 900 98.2%
CPU Utilization 12173.9 14816 82.2%
Core Utilization 13028 14816 87.9%
Slot Utilization 13072 14880 87.8%
Avail Slot Util 13072 14624 89.4%
Mem Utilization 4.3TB 58.6TB 7.3%

Visual Node Status of Gordon

System Utilization

Current as of Thursday, July 17, 2014 at 7:00 AM

Shades of blue indicate the node's cpu load (darker = higher). Red nodes are down or offline, and yellow nodes are overloaded (load is significantly higher than amount of available CPUs).

Availability and Utilization over Time

The top figure below shows utilization and availability of various resources. The bottom figure shows the capacity of the system both running and waiting in queue. Such capacity is measured in CPU core-hours (SUs) and is calculated based on the requested time for every job running and in queue. It is generated using a few R scripts which are located in my GitHub repository.

System Utilization

Current as of Thursday, July 17, 2014 at 7:00 AM

System Utilization

Current as of Thursday, July 17, 2014 at 7:00 AM

Known Events

The following events highlight abnormal features in the above availability, utilization, and queue health data.

DateEvent
June 6, 2015PM (with drain) to reboot Monkey MDS
June 18, 2015cipres reservation released
June 24, 2015networking issue (due to Arista upgrade?)

Current Utilization Breakdown

System Utilization

Capacity Running

System Utilization

Capacity Waiting

Node Utilization

Node Utilization

Core Utilization

Core Utilization

Current as of Thursday, July 17, 2014 at 7:00 AM