domino_admin_toolkit.checks.test_dmm_ingestion_jobs_status module

DMM Ingestion Jobs Status Health Check

Detects DMM (Domino Model Monitor) ingestion jobs stuck in pending or processing states by querying the model-monitor.dataset_jobs MongoDB collection and cross-referencing the dmm-redis-ha pending_jobs / running_jobs lists.

dmm-compute is single-threaded: a single job stuck in processing (Plier 5xx retry, missing datasource secret, OOM, corrupted record, infrastructure issue) blocks every subsequent job. The Prometheus metric dmm_plier_action_latency_seconds only fires on status transitions, so stuck jobs are silent today and detected only when customers complain. This check surfaces them automatically and maps each branch to one of the five root causes documented in the “DMM Common Issues” runbook.

Scope: this check focuses on per-job stuck conditions. Cluster-wide Redis queue-depth and stale-running-lock signals live in test_dmm_redis_pending_jobs and test_dmm_redis_running_jobs (uplifted to enforce thresholds per RE-3126).

Relationship to existing DMM checks:
  • This check: MongoDB + Redis cross-ref — “Are individual jobs stuck and why?”

  • test_dmm_redis_pending_jobs / running_jobs: cluster-wide queue depth thresholds — “Is the queue growing or are there stale locks?”

  • test_dmm_spark: Spark worker UI — “What executors is Spark running?”

  • test_dmm_pods_list: K8s API — “Are DMM pods up?”

pydantic model domino_admin_toolkit.checks.test_dmm_ingestion_jobs_status.DmmStuckJobsAnalyzer

Bases: DataFrameAnalyzerBase

Analyzes the per-job DataFrame of stuck DMM ingestion jobs.

Operates directly on rows (one per stuck job), computing aggregates inline. Each branch maps to a documented root cause from the DMM Common Issues runbook; multiple branches can fire in a single run (e.g., stuck processing + queue-name mismatch).

Cluster-wide Redis queue-depth and stale-running-lock signals are out of scope for this analyzer — they belong on test_dmm_redis_pending_jobs and test_dmm_redis_running_jobs (uplifted to enforce thresholds per RE-3126).

Fields:
field pending_fail_minutes: float = 240

Age threshold (minutes) for pending jobs to trigger FAIL

field pending_warn_minutes: float = 60

Age threshold (minutes) for pending jobs to trigger WARN

field processing_fail_minutes: float = 30

Age threshold (minutes) for processing jobs to trigger FAIL

analyze(data)

Analyzes the full DataFrame and returns a list of CheckResult instances.

Return type:

list[CheckResult]

Args:

data: The full DataFrame. Called once per check_df() invocation.

Returns:

List[CheckResult]: A list containing the results of the analysis.

Raises:

NotImplementedError: If this method is not implemented by subclasses.

name: ClassVar[str] = 'DmmStuckJobsAnalyzer'
domino_admin_toolkit.checks.test_dmm_ingestion_jobs_status.dmm_ingestion_data(_dmm_mongo_client, k8s_client)

Per-job DataFrame of stuck DMM ingestion jobs.

Skip semantics:
  • Redis unreachable → skip (we lose per-row cross-reference signal).

  • Mongo query failure → skip (no per-job data to analyze; cluster-wide Redis depth signals live in test_dmm_redis_pending_jobs / running_jobs per RE-3126).

domino_admin_toolkit.checks.test_dmm_ingestion_jobs_status.test_dmm_ingestion_jobs_status(dmm_ingestion_data)
Description:

Detects DMM ingestion jobs stuck in pending or processing states by querying the model-monitor.dataset_jobs MongoDB collection and cross-referencing the dmm-redis-ha pending_jobs and running_jobs lists. Each FAIL/WARN branch maps to a documented root cause from the DMM Common Issues runbook.

Use the _id column to grep dmm-compute logs for the cause:

kubectl logs deployment/dmm-compute -n <platform_namespace> | grep <_id>

See also:
  • test_dmm_redis_pending_jobs / test_dmm_redis_running_jobs — cluster-wide queue depth thresholds and stale-running-lock detection. Run these to determine whether the queue itself is sized/draining correctly, independent of any specific job.

  • test_dmm_spark (info) — Spark worker UI; check whether executors are starved or idle when jobs are stuck in processing.

  • test_dmm_pods_list — verify dmm-compute / dmm-plier / dmm-redis-ha pods are all up before diving into job state.

  • DMM Common Issues runbook (Confluence, under DMM Runbooks): full list of root causes, log signatures, and the Post-Migration Checklist.

Result:

PASS: No DMM ingestion jobs stuck, or jobs in transition are within thresholds. WARN: Pending age between 1h and 4h. FAIL: Processing job > 30 min (single-threaded compute wedged), pending

job > 4h, Mongo pending job not present in Redis (queue-name mismatch), or stuck job’s datasource secret is missing.

SKIP: MongoDB or DMM Redis is unavailable, or DMM is not deployed.

Thresholds:
  • Processing FAIL: 30 minutes

  • Pending WARN: 60 minutes

  • Pending FAIL: 240 minutes (4 hours)

Failure Conditions:
  • A job has been processing for > 30 min (dmm-compute is single-threaded; this blocks all subsequent jobs).

  • A job has been pending for > 4 h (queue not draining).

  • One or more MongoDB pending jobs are not in the Redis pending_jobs list (Helm REDIS_PENDING_QUEUE / PENDING_JOBS_QUEUE misaligned across pods).

  • A stuck job’s datasource_id has no corresponding dmm-datasource-<id> K8s secret (post-migration secret restoration missed).

Troubleshooting Steps:
  1. Identify stuck jobs in the detail table (note _id, status, job_type, datasource_id, In Redis, Secret OK).

  2. Use _id to grep dmm-compute logs for the specific failure signature:

    kubectl logs deployment/dmm-compute -n <platform_namespace> | grep <_id>

  3. If stuck in processing, run test_dmm_spark to check whether Spark executors are starved. Run test_dmm_redis_running_jobs to surface stale locks (its analyzer flags depth > 1).

  4. If “In Redis” is False for pending jobs, this is a queue-name mismatch:

    kubectl exec -n <platform_namespace> deploy/dmm-plier – env | grep -iE ‘queue|redis’ kubectl exec -n <platform_namespace> deploy/dmm-compute – env | grep -iE ‘queue|redis’

  5. If “Secret OK” is False, restore the missing dmm-datasource-<id> secret from the source environment (Post-Migration Checklist).

  6. Cross-check Plier health for 5xx retry loops:

    kubectl logs deployment/dmm-plier -n <platform_namespace> –tail=200 | grep -i error

Resolution Steps:
  1. For OOM in dmm-compute:

    kubectl describe pod -n <platform_namespace> -l app=dmm-compute # confirm OOMKilled Increase memory limits via Helm and roll out: kubectl rollout restart deployment/dmm-compute -n <platform_namespace>

  2. For Plier 5xx retry loop or corrupted record (job unrecoverable):

    db.dataset_jobs.updateOne({_id: ObjectId(‘<id>’)}, {$set: {status: ‘failed’}}) kubectl rollout restart deployment/dmm-compute -n <platform_namespace>

  3. For queue-name mismatch:

    Align Helm values across dmm-plier and dmm-compute, then re-deploy both.

  4. For missing datasource secret:

    Recreate the dmm-datasource-<id> secret in the platform namespace.

Required Permissions:

Platform admin access (kubectl exec on dmm-redis-ha, MongoDB read on model-monitor.dataset_jobs, K8s secret read in the platform namespace).