domino_admin_toolkit.checks.test_dmm_pods_list module

DMM Pods Health Check

Data source: Kubernetes API (namespace-scoped pod listings in the platform and compute namespaces).

Question answered: “Are the pods DMM ingestion depends on present and Ready?”

What this check does NOT cover:

Job-level stuck conditions — see test_dmm_ingestion_jobs_status.
Redis queue depth / stale running locks — see test_dmm_redis_queues.
Spark executor activity — see info/test_dmm_spark.

This check replaces an earlier info-only implementation that did list_pods(namespace=None) (cluster-wide) plus a substring filter against generic image-name tokens. On a 108-node customer cluster that path took ~30 minutes wall-clock and surfaced pods unrelated to DMM (Bitnami’s shared Postgres / Prometheus charts matched too). The targeted approach here is two namespace-scoped lists plus prefix matching — well under a minute even on large clusters.

pydantic model domino_admin_toolkit.checks.test_dmm_pods_list.DmmPodsAnalyzer

Bases: DataFrameAnalyzerBase

Verifies the expected DMM pod set is present and ready.

Health signal is the K8s Ready condition (not pod phase). A pod’s phase is Running whenever ≥1 container is up, so a multi-container pod with a CrashLoopBackOff sidecar would still report phase=Running even though it’s effectively broken (1/2 CrashLoopBackOff). The Ready condition aggregates per-container readiness and catches that.

DMM pods missing or not ready → FAIL.
Namespace-level list_pods failure → ERROR (one per namespace) — we couldn’t determine pod state, so the underlying DMM state is unknown rather than known-bad. Avoids cascading 5+ misleading FAIL results from a single toolkit-side API hiccup.
Everything green → PASS.

Fields:

analyze(data)

Analyzes the full DataFrame and returns a list of CheckResult instances.

Return type:: list[CheckResult]

Args:: data: The full DataFrame. Called once per check_df() invocation.
Returns:: List[CheckResult]: A list containing the results of the analysis.
Raises:: NotImplementedError: If this method is not implemented by subclasses.

name: ClassVar[str] = 'DmmPodsAnalyzer'

domino_admin_toolkit.checks.test_dmm_pods_list.dmm_pods_data(k8s_client, platform_namespace, compute_namespace)

Per-expected-pod DataFrame consumed by DmmPodsAnalyzer.

Return type:: DataFrame

domino_admin_toolkit.checks.test_dmm_pods_list.test_dmm_pods_list(dmm_pods_data, runner)

Description:

Verifies the DMM pod set (dmm-compute, dmm-plier, dmm-redis-ha, spark3-master, spark3-worker) is present and Ready. Observability and auth pods (prometheus, keycloak) are intentionally not covered here — see test_prometheus / test_keycloak for those.

Result:

PASS: All expected DMM pods are present and Ready. FAIL: Any DMM pod (dmm-compute, dmm-plier, dmm-redis-ha-server-0,

spark3-master, spark3-worker) is missing or not Ready.

Troubleshooting Steps:

Check the detail table for any rows with found = False or ready = False.
For missing pods:
kubectl get pods -n <namespace> | grep <expected_name> kubectl describe deployment/<expected_name> -n <namespace>
For unhealthy pods:
kubectl describe pod -n <namespace> <pod_name> kubectl logs -n <namespace> <pod_name> –tail=200

Resolution Steps:

Roll the affected workload:
kubectl rollout restart deployment/<name> -n <namespace>
If a node-level constraint blocks scheduling (taints, resource pressure), drain or scale the node pool to unblock.

Required Permissions:

Platform admin (kubectl read on platform and compute namespaces).