domino_admin_toolkit.checks.test_dmm_pods_list module
DMM Pods Health Check
Data source: Kubernetes API (namespace-scoped pod listings in the platform and compute namespaces).
Question answered: “Are the pods DMM ingestion depends on present and Ready?”
- What this check does NOT cover:
Job-level stuck conditions — see test_dmm_ingestion_jobs_status.
Redis queue depth / stale running locks — see test_dmm_redis_queues.
Spark executor activity — see info/test_dmm_spark.
This check replaces an earlier info-only implementation that did
list_pods(namespace=None) (cluster-wide) plus a substring filter against
generic image-name tokens. On a 108-node customer cluster that path took
~30 minutes wall-clock and surfaced pods unrelated to DMM (Bitnami’s
shared Postgres / Prometheus charts matched too). The targeted approach
here is two namespace-scoped lists plus prefix matching — well under a
minute even on large clusters.
- pydantic model domino_admin_toolkit.checks.test_dmm_pods_list.DmmPodsAnalyzer
Bases:
DataFrameAnalyzerBaseVerifies the expected DMM pod set is present and ready.
Health signal is the K8s
Readycondition (not pod phase). A pod’s phase isRunningwhenever ≥1 container is up, so a multi-container pod with a CrashLoopBackOff sidecar would still reportphase=Runningeven though it’s effectively broken (1/2 CrashLoopBackOff). The Ready condition aggregates per-container readiness and catches that.DMM pods missing or not ready → FAIL.
Namespace-level
list_podsfailure → ERROR (one per namespace) — we couldn’t determine pod state, so the underlying DMM state is unknown rather than known-bad. Avoids cascading 5+ misleading FAIL results from a single toolkit-side API hiccup.Everything green → PASS.
- Fields:
- analyze(data)
Analyzes the full DataFrame and returns a list of CheckResult instances.
- Return type:
- Args:
data: The full DataFrame. Called once per
check_df()invocation.- Returns:
List[CheckResult]: A list containing the results of the analysis.
- Raises:
NotImplementedError: If this method is not implemented by subclasses.
- name: ClassVar[str] = 'DmmPodsAnalyzer'
- domino_admin_toolkit.checks.test_dmm_pods_list.dmm_pods_data(k8s_client, platform_namespace, compute_namespace)
Per-expected-pod DataFrame consumed by DmmPodsAnalyzer.
- Return type:
- domino_admin_toolkit.checks.test_dmm_pods_list.test_dmm_pods_list(dmm_pods_data, runner)
- Description:
Verifies the DMM pod set (dmm-compute, dmm-plier, dmm-redis-ha, spark3-master, spark3-worker) is present and Ready. Observability and auth pods (prometheus, keycloak) are intentionally not covered here — see test_prometheus / test_keycloak for those.
- Result:
PASS: All expected DMM pods are present and Ready. FAIL: Any DMM pod (dmm-compute, dmm-plier, dmm-redis-ha-server-0,
spark3-master, spark3-worker) is missing or not Ready.
- Troubleshooting Steps:
Check the detail table for any rows with found = False or ready = False.
- For missing pods:
kubectl get pods -n <namespace> | grep <expected_name> kubectl describe deployment/<expected_name> -n <namespace>
- For unhealthy pods:
kubectl describe pod -n <namespace> <pod_name> kubectl logs -n <namespace> <pod_name> –tail=200
- Resolution Steps:
- Roll the affected workload:
kubectl rollout restart deployment/<name> -n <namespace>
If a node-level constraint blocks scheduling (taints, resource pressure), drain or scale the node pool to unblock.
- Required Permissions:
Platform admin (kubectl read on platform and compute namespaces).