domino_admin_toolkit.checks.test_node_oom_kills module

pydantic model domino_admin_toolkit.checks.test_node_oom_kills.NodeOOMKillAnalyzer

Bases: AnalyzerBase

Detects kernel-level OOM kills on cluster nodes.

Fields:

oom_kill_threshold (int)

field oom_kill_threshold: int = 0

analyze(data)

Evaluate a single node’s OOM kill count against the configured threshold.

Return type:: list[CheckResult]

name: ClassVar[str] = 'NodeOOMKillAnalyzer'

domino_admin_toolkit.checks.test_node_oom_kills.node_oom_data(prometheus_client_v2)

Collect node OOM kill metrics from Prometheus.

Return type:: DataFrame

domino_admin_toolkit.checks.test_node_oom_kills.test_node_oom_kills(node_oom_data)

Description: Detects kernel-level OOM kills on cluster nodes. Failure Conditions: Any node had host-level OOM kills in the last hour. Troubleshooting Steps:

Identify which node(s) had OOM kills from the output table

Check node memory usage: kubectl top node <node>

Check system process memory: ssh to node, run ps aux –sort=-rss | head

Check for memory leaks in system processes (containerd, kubelet)

Resolution Steps:

Cordon the affected node: kubectl cordon <node>
Drain workloads: kubectl drain <node> –ignore-daemonsets
Investigate root cause (containerd leak, kubelet memory, etc.)
Restart the affected system process or replace the node

Required Permissions: Cluster admin, node SSH access for investigation