domino_admin_toolkit.checks.test_k8s_platform_sizing module

class domino_admin_toolkit.checks.test_k8s_platform_sizing.TestK8sPlatformSizing

Bases: object

test_k8s_platform_pod_sizing(container_memory_df, container_cpu_df, df_key, analyzers, sort_col, column_order, top_n)
Description: Validates per-pod memory and CPU usage on platform nodes over the last hour.

Uses node-pool-scoped queries (pods on platform nodes regardless of namespace) when node pool labels are present; falls back to platform namespace scope otherwise.

Failure Conditions:

memory-excess: Any pod’s avg memory usage exceeds its requests by more than 2 GiB. memory-oom: Any pod has OOM kill events (container_oom_events_total > 0) in the last hour. cpu-throttling: Any pod’s CPU throttle percentage exceeds 50%. cpu-excess: Any pod’s avg CPU usage exceeds its requests by more than 2 cores.

Troubleshooting Steps:
  1. Check pod resource usage: kubectl top pods -n <namespace>

  2. Review container resource requests in the deployment spec

  3. Check for memory leaks or unexpected load patterns via Grafana sizing dashboard

Resolution Steps:
  1. Adjust container memory/CPU requests and limits in the helm values

  2. For OOM kills: increase memory limits or investigate memory leak

  3. For CPU throttling: increase CPU limits or reduce workload

Required Permissions: Platform admin access

domino_admin_toolkit.checks.test_k8s_platform_sizing.container_cpu_df(prometheus_client_v2, platform_namespace)

Per-pod CPU metrics for platform nodes, collected once per session.

Return type:

DataFrame

domino_admin_toolkit.checks.test_k8s_platform_sizing.container_memory_df(prometheus_client_v2, platform_namespace)

Per-pod memory metrics for platform nodes, collected once per session.

Return type:

DataFrame

domino_admin_toolkit.checks.test_k8s_platform_sizing.format_cpu_columns(df)
Return type:

DataFrame

domino_admin_toolkit.checks.test_k8s_platform_sizing.format_memory_columns(df)
Return type:

DataFrame