domino_admin_toolkit.checks.test_dmm_redis_queues module

DMM Redis Queue Health Checks

Data source: Redis (dmm-redis-ha instance) — LLEN and LRANGE on the pending_jobs and running_jobs lists managed by dmm-plier and dmm-compute.

Question answered: “Is the DMM job queue draining, and are there stale running-locks from crashed compute pods?”

What this check does NOT cover:
  • Per-job stuck conditions or age — see test_dmm_ingestion_jobs_status.

  • DMM pod readiness — see test_dmm_pods_list.

  • Spark executor state — see info/test_dmm_spark.

Two checks live here, sharing a single Redis fetch per pytest session:

  • test_dmm_redis_pending_jobs — WARN at depth > 10, FAIL at depth > 30. A growing pending queue means compute can’t keep up or is wedged.

  • test_dmm_redis_running_jobs — WARN at depth > 1. dmm-compute is single-threaded, so any depth above 1 means at least one crashed compute pod left a stale lock behind that will block new jobs from starting.

Thresholds inherited from RE-3125’s earlier analyzer (the per-job analyzer was the wrong place for cluster-wide queue depth — this is the right place).

pydantic model domino_admin_toolkit.checks.test_dmm_redis_queues.DmmPendingQueueDepthAnalyzer

Bases: AnalyzerBase[QueueDepthRow]

WARN at depth > QUEUE_DEPTH_WARN, FAIL at depth > QUEUE_DEPTH_FAIL.

Fields:
field fail_threshold: int = 30

FAIL when pending depth exceeds this.

field warn_threshold: int = 10

WARN when pending depth exceeds this.

analyze(data)

Analyzes one row and returns a list of CheckResult instances.

Return type:

list[CheckResult]

Args:

data: One row dict (TRow). The Runner calls this once per DataFrame row.

Returns:

List[CheckResult]: A list containing the results of the analysis.

Raises:

NotImplementedError: If this method is not implemented by subclasses.

name: ClassVar[str] = 'DmmPendingQueueDepthAnalyzer'
pydantic model domino_admin_toolkit.checks.test_dmm_redis_queues.DmmRunningQueueAnalyzer

Bases: AnalyzerBase[QueueDepthRow]

WARN at depth > RUNNING_QUEUE_EXPECTED_MAX (stale locks).

Fields:
field expected_max: int = 1

Expected ceiling on running_jobs depth. dmm-compute is single-threaded.

analyze(data)

Analyzes one row and returns a list of CheckResult instances.

Return type:

list[CheckResult]

Args:

data: One row dict (TRow). The Runner calls this once per DataFrame row.

Returns:

List[CheckResult]: A list containing the results of the analysis.

Raises:

NotImplementedError: If this method is not implemented by subclasses.

name: ClassVar[str] = 'DmmRunningQueueAnalyzer'
class domino_admin_toolkit.checks.test_dmm_redis_queues.QueueDepthRow

Bases: TypedDict

Per-row shape passed to the queue analyzers.

depth: int
queue: str
domino_admin_toolkit.checks.test_dmm_redis_queues.dmm_redis_queue_depths(k8s_client)

Single Redis hit per session: LLEN on both DMM queues. Returns a two-row DataFrame (queue ∈ {pending_jobs, running_jobs}, depth) so each test can filter the slice its analyzer reasons over.

Skip semantics mirror the rest of the DMM checks — Redis unreachable (RedisError or low-level connection errors) → skip. Anything more surprising (auth, protocol) propagates as a test ERROR. We do NOT return an empty DataFrame on failure: that would present as a misleading PASS-with-no-data via Runner’s on_empty path.

domino_admin_toolkit.checks.test_dmm_redis_queues.test_dmm_redis_pending_jobs(dmm_redis_queue_depths, runner)
Description:

Reports the depth of the DMM pending_jobs Redis list.

Result:

PASS: depth ≤ 10. WARN: 10 < depth ≤ 30. Queue may be backing up; investigate compute health. FAIL: depth > 30. Queue is growing without draining. SKIP: DMM Redis unavailable.

Thresholds:
  • WARN: depth > 10

  • FAIL: depth > 30

Required Permissions:

Platform admin (kubectl exec on dmm-redis-ha for manual LRANGE).

domino_admin_toolkit.checks.test_dmm_redis_queues.test_dmm_redis_running_jobs(dmm_redis_queue_depths, runner)
Description:

Reports the depth of the DMM running_jobs Redis list. Because dmm-compute is single-threaded, this list should hold at most one entry at a time. Anything above 1 means a previous compute pod crashed mid-job and left a stale lock behind.

Result:

PASS: depth ≤ 1. WARN: depth > 1. Stale locks present; investigate crashed compute pods. SKIP: DMM Redis unavailable.

Thresholds:
  • WARN: depth > 1

Required Permissions:

Platform admin (kubectl exec on dmm-redis-ha for manual LRANGE / DEL).