domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health module
Pre-upgrade RabbitMQ stream health check.
Two complementary surfaces are queried for the same streams: the Management
API (authoritative for per-stream state / members / online /
leader) and Prometheus (Raft-internal coordinator + per-stream membership
that the Management API can lie about with a stale cache). A silent-hang
upgrade incident reproduced when these two surfaces disagreed about the
leader, so the check uses both and requires they agree.
- domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health.rabbitmq_stream_data(k8s_client)
Per-stream rows from the RabbitMQ Management API.
- Return type:
- domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health.rabbitmq_stream_prometheus_data(prometheus_client_v2)
Three stream-health DataFrames from Prometheus: coordinator, leader presence, segments/lag.
- domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health.test_known_streams_present(rabbitmq_stream_data, runner)
- Description:
Asserts that the two Nucleus-consumed streams (
data-plane.resourcesandworkload_status) are present in the RabbitMQ broker. Distinct failure surface fromtest_rabbitmq_stream_healthso the operator can tell “stream missing — topology was reset, Nucleus will redeclare” apart from “stream present but unhealthy”.- Failure Conditions:
One or both of the expected streams is absent from
/api/queues.
- Troubleshooting Steps:
Confirm RabbitMQ pods are Running:
kubectl -n <platform-ns> get pods -l app=rabbitmq-ha.Restart the Nucleus dispatcher (
kubectl -n <platform-ns> rollout restart deploy/nucleus-dispatcher) to trigger stream redeclaration if topology was reset intentionally.
- Resolution Steps:
If the streams should already exist (no recent PVC reset), reset and re-form the RabbitMQ stream PVCs.
If the streams legitimately do not exist yet (fresh install or post-reset), starting Nucleus will declare them; this check is expected to pass on the next run.
- Required Permissions:
Platform admin access to read the RabbitMQ admin secret and to restart Nucleus dispatcher.
- See also:
test_rabbitmq_stream_health — companion check in this file that validates per-stream leader and Management-API health once the streams exist.
- domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health.test_rabbitmq_stream_health(rabbitmq_stream_data, rabbitmq_stream_prometheus_data, runner)
- Description:
Pre-upgrade gate for RabbitMQ stream health. Validates that every declared stream has a healthy state on the Management API, that the stream coordinator is converging, and that Prometheus agrees a leader is elected per stream. Surfaces the silent-hang precursor (a stream whose
membersset has drifted fromonline) before an upgrade that crosses the RabbitMQ 5.11.x → 6.1.x boundary.- Failure Conditions:
A stream is in any state other than
running.A stream has no leader elected (Management API).
A stream’s
membersset is not a subset ofonline.Stream coordinator apply lag exceeds 10 entries.
Stream coordinator commit latency exceeds 500ms.
Prometheus reports a stream with no leader present.
- Troubleshooting Steps:
From the Domino UI / kubectl, confirm the RabbitMQ pods (
rabbitmq-ha-*) are Running and Ready:kubectl -n <platform-ns> get pods -l app=rabbitmq-ha.Inspect the Management UI
/api/queues?type=streamfor the failed stream’sstate/members/onlinefields, or query the same viakubectl port-forward svc/rabbitmq-ha 15672and the admin UI.If the Management API and Prometheus disagree on a stream’s leader, treat the Management API as stale and trust the Prometheus signal.
- Resolution Steps:
Reset (wipe and re-form) the RabbitMQ stream PVCs before retrying the upgrade. Nucleus will redeclare both
data-plane.resourcesandworkload_statuson startup.After PVC reset, re-run this check to confirm both streams come back healthy on both surfaces before unblocking the upgrade.
- Required Permissions:
Platform admin access to read the RabbitMQ admin secret in the platform namespace and to
kubectl execinto the broker pod for the PVC reset.
- See also:
../test_rabbitmq.py::test_rabbitmq_dead_queues — runtime sibling that flags any queue (including streams) not in
runningstate.../test_rabbitmq.py::test_rabbitmq_node_status — broker-level health; failing here usually means stream checks also fail.
../test_rabbitmq.py::test_rabbitmq_feature_flags — direct neighbor for upgrade gating.
test_mongodb.py::test_mongodb_replicaset — sibling pre-upgrade check for a different stateful clustered system; same validate-before-restart pattern.