domino_admin_toolkit.checks.test_dmm_pods_list module

DMM Pods Health Check

Data source: Kubernetes API (namespace-scoped pod listings in the platform and compute namespaces).

Question answered: “Are the pods DMM ingestion depends on present and Ready?”

What this check does NOT cover:
  • Job-level stuck conditions — see test_dmm_ingestion_jobs_status.

  • Redis queue depth / stale running locks — see test_dmm_redis_queues.

  • Spark executor activity — see info/test_dmm_spark.

This check replaces an earlier info-only implementation that did list_pods(namespace=None) (cluster-wide) plus a substring filter against generic image-name tokens. On a 108-node customer cluster that path took ~30 minutes wall-clock and surfaced pods unrelated to DMM (Bitnami’s shared Postgres / Prometheus charts matched too). The targeted approach here is two namespace-scoped lists plus prefix matching — well under a minute even on large clusters.

pydantic model domino_admin_toolkit.checks.test_dmm_pods_list.DmmPodsAnalyzer

Bases: DataFrameAnalyzerBase

Verifies the expected DMM pod set is present and ready.

Health signal is the K8s Ready condition (not pod phase). A pod’s phase is Running whenever ≥1 container is up, so a multi-container pod with a CrashLoopBackOff sidecar would still report phase=Running even though it’s effectively broken (1/2 CrashLoopBackOff). The Ready condition aggregates per-container readiness and catches that.

  • DMM pods missing or not ready → FAIL.

  • Namespace-level list_pods failure → ERROR (one per namespace) — we couldn’t determine pod state, so the underlying DMM state is unknown rather than known-bad. Avoids cascading 5+ misleading FAIL results from a single toolkit-side API hiccup.

  • Everything green → PASS.

Fields:

analyze(data)

Analyzes the full DataFrame and returns a list of CheckResult instances.

Return type:

list[CheckResult]

Args:

data: The full DataFrame. Called once per check_df() invocation.

Returns:

List[CheckResult]: A list containing the results of the analysis.

Raises:

NotImplementedError: If this method is not implemented by subclasses.

name: ClassVar[str] = 'DmmPodsAnalyzer'
domino_admin_toolkit.checks.test_dmm_pods_list.dmm_pods_data(k8s_client, platform_namespace, compute_namespace)

Per-expected-pod DataFrame consumed by DmmPodsAnalyzer.

Return type:

DataFrame

domino_admin_toolkit.checks.test_dmm_pods_list.test_dmm_pods_list(dmm_pods_data, runner)
Description:

Verifies the DMM pod set (dmm-compute, dmm-plier, dmm-redis-ha, spark3-master, spark3-worker) is present and Ready. Observability and auth pods (prometheus, keycloak) are intentionally not covered here — see test_prometheus / test_keycloak for those.

Result:

PASS: All expected DMM pods are present and Ready. FAIL: Any DMM pod (dmm-compute, dmm-plier, dmm-redis-ha-server-0,

spark3-master, spark3-worker) is missing or not Ready.

Troubleshooting Steps:
  1. Check the detail table for any rows with found = False or ready = False.

  2. For missing pods:

    kubectl get pods -n <namespace> | grep <expected_name> kubectl describe deployment/<expected_name> -n <namespace>

  3. For unhealthy pods:

    kubectl describe pod -n <namespace> <pod_name> kubectl logs -n <namespace> <pod_name> –tail=200

Resolution Steps:
  1. Roll the affected workload:

    kubectl rollout restart deployment/<name> -n <namespace>

  2. If a node-level constraint blocks scheduling (taints, resource pressure), drain or scale the node pool to unblock.

Required Permissions:

Platform admin (kubectl read on platform and compute namespaces).