domino_admin_toolkit.checks.test_node_oom_kills module
- pydantic model domino_admin_toolkit.checks.test_node_oom_kills.NodeOOMKillAnalyzer
Bases:
AnalyzerBaseDetects kernel-level OOM kills on cluster nodes.
- Fields:
- analyze(data)
Evaluate a single node’s OOM kill count against the configured threshold.
- Return type:
- name: ClassVar[str] = 'NodeOOMKillAnalyzer'
- domino_admin_toolkit.checks.test_node_oom_kills.node_oom_data(prometheus_client_v2)
Collect node OOM kill metrics from Prometheus.
- Return type:
- domino_admin_toolkit.checks.test_node_oom_kills.test_node_oom_kills(node_oom_data)
Description: Detects kernel-level OOM kills on cluster nodes. Failure Conditions: Any node had host-level OOM kills in the last hour. Troubleshooting Steps:
Identify which node(s) had OOM kills from the output table
Check node memory usage: kubectl top node <node>
Check system process memory: ssh to node, run ps aux –sort=-rss | head
Check for memory leaks in system processes (containerd, kubelet)
- Resolution Steps:
Cordon the affected node: kubectl cordon <node>
Drain workloads: kubectl drain <node> –ignore-daemonsets
Investigate root cause (containerd leak, kubelet memory, etc.)
Restart the affected system process or replace the node
Required Permissions: Cluster admin, node SSH access for investigation