domino_admin_toolkit.checks.test_node_oom_kills module

pydantic model domino_admin_toolkit.checks.test_node_oom_kills.NodeOOMKillAnalyzer

Bases: AnalyzerBase

Detects kernel-level OOM kills on cluster nodes.

Fields:
field oom_kill_threshold: int = 0
analyze(data)

Evaluate a single node’s OOM kill count against the configured threshold.

Return type:

list[CheckResult]

name: ClassVar[str] = 'NodeOOMKillAnalyzer'
domino_admin_toolkit.checks.test_node_oom_kills.node_oom_data(prometheus_client_v2)

Collect node OOM kill metrics from Prometheus.

Return type:

DataFrame

domino_admin_toolkit.checks.test_node_oom_kills.test_node_oom_kills(node_oom_data)

Description: Detects kernel-level OOM kills on cluster nodes. Failure Conditions: Any node had host-level OOM kills in the last hour. Troubleshooting Steps:

  1. Identify which node(s) had OOM kills from the output table

  2. Check node memory usage: kubectl top node <node>

  3. Check system process memory: ssh to node, run ps aux –sort=-rss | head

  4. Check for memory leaks in system processes (containerd, kubelet)

Resolution Steps:
  1. Cordon the affected node: kubectl cordon <node>

  2. Drain workloads: kubectl drain <node> –ignore-daemonsets

  3. Investigate root cause (containerd leak, kubelet memory, etc.)

  4. Restart the affected system process or replace the node

Required Permissions: Cluster admin, node SSH access for investigation