domino_admin_toolkit.checks.test_dataset_lifecycle_status module

Dataset Lifecycle Status Health Check

Monitors MongoDB (datasetrw, datasetrw_snapshot) for datasets and snapshots stuck in deletion lifecycle states: MarkedForDeletion, DeletionInProgress, or Failed.

Stuck deletions are a common symptom of filetask callback failures, stuck K8s jobs, or dataset-rw service errors. This check surfaces those items automatically and provides actionable next steps for administrators.

Relationship to test_filetask_queue_status:
  • This check (dataset lifecycle): MongoDB — “Are datasets stuck in a deletion state?”

  • Filetask queue check: PostgreSQL — “Are filetask jobs stuck or failing?”

Together they give both the symptom (dataset stuck) and the cause (filetask job stuck).

pydantic model domino_admin_toolkit.checks.test_dataset_lifecycle_status.DatasetLifecycleAnalyzer

Bases: AnalyzerBase

Analyzes dataset and snapshot lifecycle states for stuck deletion conditions.

Checks whether datasets/snapshots have been in DeletionInProgress, MarkedForDeletion, or Failed states for longer than acceptable thresholds.

Fields:
field dip_fail_minutes: float = 60

Age threshold (minutes) for DeletionInProgress items to trigger FAIL

field mfd_warn_minutes: float = 1440

Age threshold (minutes) for MarkedForDeletion items to trigger WARN

analyze(data)

Analyze lifecycle summary metrics for stuck deletion conditions.

Return type:

list[CheckResult]

Args:

data: Single-row dict from the lifecycle summary DataFrame

Returns:

list[CheckResult]: Results with PASS/WARN/FAIL status and actionable messages

name: ClassVar[str] = 'DatasetLifecycleAnalyzer'
domino_admin_toolkit.checks.test_dataset_lifecycle_status.dataset_lifecycle_data(mongo_client)

Collect stuck dataset/snapshot lifecycle data from MongoDB.

domino_admin_toolkit.checks.test_dataset_lifecycle_status.dataset_lifecycle_summary(dataset_lifecycle_data)

Generate aggregated summary statistics from stuck lifecycle data.

domino_admin_toolkit.checks.test_dataset_lifecycle_status.test_dataset_lifecycle_status(dataset_lifecycle_data, dataset_lifecycle_summary)
Description:

Checks MongoDB (datasetrw, datasetrw_snapshot) for datasets and snapshots stuck in deletion lifecycle states: MarkedForDeletion, DeletionInProgress, or Failed. Stuck deletions are a common symptom of filetask callback failures, stuck filetask K8s jobs, or dataset-rw service errors.

The _id column in the results table is the same value stored in the filetask PostgreSQL tasks.key column. Use it to correlate stuck datasets with their corresponding filetask jobs:

SELECT * FROM tasks WHERE key = ‘<_id>’;

See also: test_filetask_queue_status — checks the filetask processor side to determine whether the underlying file deletion jobs are stuck or have failed.

Result:

PASS: No datasets or snapshots are stuck in deletion states. WARN: Datasets/snapshots have been in MarkedForDeletion longer than 24 hours. FAIL: Datasets/snapshots have been in DeletionInProgress longer than 60 minutes,

or items are in a Failed state.

SKIP: MongoDB is unavailable.

Thresholds:
  • DeletionInProgress FAIL threshold: 60 minutes

  • MarkedForDeletion WARN threshold: 1440 minutes (24 hours)

  • Failed state: always FAIL (any count)

Failure Conditions:
  • Dataset or snapshot has been DeletionInProgress for > 60 minutes, indicating the filetask job did not complete or the callback failed.

  • Dataset or snapshot is in Failed state, indicating a hard error in the deletion pipeline requiring manual intervention.

  • Dataset or snapshot has been MarkedForDeletion for > 24 hours without transitioning to DeletionInProgress.

Troubleshooting Steps:
  1. Identify stuck items in the table below (note the _id and lifecycle_status).

  2. For DeletionInProgress — check the filetask queue:

    Run: test_filetask_queue_status Or: SELECT * FROM tasks WHERE key = ‘<_id>’ AND status NOT IN (‘Completed’, ‘Failed’);

  3. Check dataset-rw service logs for errors:

    kubectl logs -n <platform_namespace> -l app=dataset-rw –tail=200 | grep -i error

  4. Check for stuck/failed K8s jobs from filetask:

    kubectl get jobs -n <compute_namespace> | grep dataset

  5. For MarkedForDeletion — verify dataset-rw is running:

    kubectl get pods -n <platform_namespace> -l app=dataset-rw

Resolution Steps:
  1. If the filetask job is stuck (per test_filetask_queue_status or SQL query):

    kubectl logs -n <platform_namespace> -l app=filetask-service –tail=200 kubectl rollout restart deployment/filetask-service -n <platform_namespace>

  2. If dataset-rw did not trigger deletion (stuck in MarkedForDeletion):

    kubectl rollout restart deployment/dataset-rw -n <platform_namespace>

  3. If a dataset is in Failed state and cannot self-recover:

    Contact Domino support with the _id values for manual MongoDB remediation. Ref: https://support.domino.ai/support/s/article/Is-filetask-stuck-datasets-admin-page-showsDeletionsInProgress

Required Permissions:

Platform admin access (kubectl, MongoDB read, PostgreSQL read)