domino_admin_toolkit.checks.test_dataset_lifecycle_status module
Dataset Lifecycle Status Health Check
Monitors MongoDB (datasetrw, datasetrw_snapshot) for datasets and snapshots stuck in deletion lifecycle states: MarkedForDeletion, DeletionInProgress, or Failed.
Stuck deletions are a common symptom of filetask callback failures, stuck K8s jobs, or dataset-rw service errors. This check surfaces those items automatically and provides actionable next steps for administrators.
- Relationship to test_filetask_queue_status:
This check (dataset lifecycle): MongoDB — “Are datasets stuck in a deletion state?”
Filetask queue check: PostgreSQL — “Are filetask jobs stuck or failing?”
Together they give both the symptom (dataset stuck) and the cause (filetask job stuck).
- pydantic model domino_admin_toolkit.checks.test_dataset_lifecycle_status.DatasetLifecycleAnalyzer
Bases:
AnalyzerBaseAnalyzes dataset and snapshot lifecycle states for stuck deletion conditions.
Checks whether datasets/snapshots have been in DeletionInProgress, MarkedForDeletion, or Failed states for longer than acceptable thresholds.
-
field dip_fail_minutes:
float= 60 Age threshold (minutes) for DeletionInProgress items to trigger FAIL
-
field mfd_warn_minutes:
float= 1440 Age threshold (minutes) for MarkedForDeletion items to trigger WARN
- analyze(data)
Analyze lifecycle summary metrics for stuck deletion conditions.
- Return type:
- Args:
data: Single-row dict from the lifecycle summary DataFrame
- Returns:
list[CheckResult]: Results with PASS/WARN/FAIL status and actionable messages
- name: ClassVar[str] = 'DatasetLifecycleAnalyzer'
-
field dip_fail_minutes:
- domino_admin_toolkit.checks.test_dataset_lifecycle_status.dataset_lifecycle_data(mongo_client)
Collect stuck dataset/snapshot lifecycle data from MongoDB.
- domino_admin_toolkit.checks.test_dataset_lifecycle_status.dataset_lifecycle_summary(dataset_lifecycle_data)
Generate aggregated summary statistics from stuck lifecycle data.
- domino_admin_toolkit.checks.test_dataset_lifecycle_status.test_dataset_lifecycle_status(dataset_lifecycle_data, dataset_lifecycle_summary)
- Description:
Checks MongoDB (datasetrw, datasetrw_snapshot) for datasets and snapshots stuck in deletion lifecycle states: MarkedForDeletion, DeletionInProgress, or Failed. Stuck deletions are a common symptom of filetask callback failures, stuck filetask K8s jobs, or dataset-rw service errors.
The _id column in the results table is the same value stored in the filetask PostgreSQL tasks.key column. Use it to correlate stuck datasets with their corresponding filetask jobs:
SELECT * FROM tasks WHERE key = ‘<_id>’;
See also: test_filetask_queue_status — checks the filetask processor side to determine whether the underlying file deletion jobs are stuck or have failed.
- Result:
PASS: No datasets or snapshots are stuck in deletion states. WARN: Datasets/snapshots have been in MarkedForDeletion longer than 24 hours. FAIL: Datasets/snapshots have been in DeletionInProgress longer than 60 minutes,
or items are in a Failed state.
SKIP: MongoDB is unavailable.
- Thresholds:
DeletionInProgress FAIL threshold: 60 minutes
MarkedForDeletion WARN threshold: 1440 minutes (24 hours)
Failed state: always FAIL (any count)
- Failure Conditions:
Dataset or snapshot has been DeletionInProgress for > 60 minutes, indicating the filetask job did not complete or the callback failed.
Dataset or snapshot is in Failed state, indicating a hard error in the deletion pipeline requiring manual intervention.
Dataset or snapshot has been MarkedForDeletion for > 24 hours without transitioning to DeletionInProgress.
- Troubleshooting Steps:
Identify stuck items in the table below (note the _id and lifecycle_status).
- For DeletionInProgress — check the filetask queue:
Run: test_filetask_queue_status Or: SELECT * FROM tasks WHERE key = ‘<_id>’ AND status NOT IN (‘Completed’, ‘Failed’);
- Check dataset-rw service logs for errors:
kubectl logs -n <platform_namespace> -l app=dataset-rw –tail=200 | grep -i error
- Check for stuck/failed K8s jobs from filetask:
kubectl get jobs -n <compute_namespace> | grep dataset
- For MarkedForDeletion — verify dataset-rw is running:
kubectl get pods -n <platform_namespace> -l app=dataset-rw
- Resolution Steps:
- If the filetask job is stuck (per test_filetask_queue_status or SQL query):
kubectl logs -n <platform_namespace> -l app=filetask-service –tail=200 kubectl rollout restart deployment/filetask-service -n <platform_namespace>
- If dataset-rw did not trigger deletion (stuck in MarkedForDeletion):
kubectl rollout restart deployment/dataset-rw -n <platform_namespace>
- If a dataset is in Failed state and cannot self-recover:
Contact Domino support with the _id values for manual MongoDB remediation. Ref: https://support.domino.ai/support/s/article/Is-filetask-stuck-datasets-admin-page-showsDeletionsInProgress
- Required Permissions:
Platform admin access (kubectl, MongoDB read, PostgreSQL read)