domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health module

Pre-upgrade RabbitMQ stream health check.

Two complementary surfaces are queried for the same streams: the Management API (authoritative for per-stream state / members / online / leader) and Prometheus (Raft-internal coordinator + per-stream membership that the Management API can lie about with a stale cache). A silent-hang upgrade incident reproduced when these two surfaces disagreed about the leader, so the check uses both and requires they agree.

domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health.rabbitmq_stream_data(k8s_client)

Per-stream rows from the RabbitMQ Management API.

Return type:

DataFrame

domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health.rabbitmq_stream_prometheus_data(prometheus_client_v2)

Three stream-health DataFrames from Prometheus: coordinator, leader presence, segments/lag.

Return type:

dict[str, DataFrame]

domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health.test_known_streams_present(rabbitmq_stream_data, runner)
Description:

Asserts that the two Nucleus-consumed streams (data-plane.resources and workload_status) are present in the RabbitMQ broker. Distinct failure surface from test_rabbitmq_stream_health so the operator can tell “stream missing — topology was reset, Nucleus will redeclare” apart from “stream present but unhealthy”.

Failure Conditions:
  • One or both of the expected streams is absent from /api/queues.

Troubleshooting Steps:
  1. Confirm RabbitMQ pods are Running: kubectl -n <platform-ns> get pods -l app=rabbitmq-ha.

  2. Restart the Nucleus dispatcher (kubectl -n <platform-ns> rollout restart deploy/nucleus-dispatcher) to trigger stream redeclaration if topology was reset intentionally.

Resolution Steps:
  1. If the streams should already exist (no recent PVC reset), reset and re-form the RabbitMQ stream PVCs.

  2. If the streams legitimately do not exist yet (fresh install or post-reset), starting Nucleus will declare them; this check is expected to pass on the next run.

Required Permissions:
  • Platform admin access to read the RabbitMQ admin secret and to restart Nucleus dispatcher.

See also:
  • test_rabbitmq_stream_health — companion check in this file that validates per-stream leader and Management-API health once the streams exist.

domino_admin_toolkit.checks.pre_upgrade.test_rabbitmq_stream_health.test_rabbitmq_stream_health(rabbitmq_stream_data, rabbitmq_stream_prometheus_data, runner)
Description:

Pre-upgrade gate for RabbitMQ stream health. Validates that every declared stream has a healthy state on the Management API, that the stream coordinator is converging, and that Prometheus agrees a leader is elected per stream. Surfaces the silent-hang precursor (a stream whose members set has drifted from online) before an upgrade that crosses the RabbitMQ 5.11.x → 6.1.x boundary.

Failure Conditions:
  • A stream is in any state other than running.

  • A stream has no leader elected (Management API).

  • A stream’s members set is not a subset of online.

  • Stream coordinator apply lag exceeds 10 entries.

  • Stream coordinator commit latency exceeds 500ms.

  • Prometheus reports a stream with no leader present.

Troubleshooting Steps:
  1. From the Domino UI / kubectl, confirm the RabbitMQ pods (rabbitmq-ha-*) are Running and Ready: kubectl -n <platform-ns> get pods -l app=rabbitmq-ha.

  2. Inspect the Management UI /api/queues?type=stream for the failed stream’s state / members / online fields, or query the same via kubectl port-forward svc/rabbitmq-ha 15672 and the admin UI.

  3. If the Management API and Prometheus disagree on a stream’s leader, treat the Management API as stale and trust the Prometheus signal.

Resolution Steps:
  1. Reset (wipe and re-form) the RabbitMQ stream PVCs before retrying the upgrade. Nucleus will redeclare both data-plane.resources and workload_status on startup.

  2. After PVC reset, re-run this check to confirm both streams come back healthy on both surfaces before unblocking the upgrade.

Required Permissions:
  • Platform admin access to read the RabbitMQ admin secret in the platform namespace and to kubectl exec into the broker pod for the PVC reset.

See also:
  • ../test_rabbitmq.py::test_rabbitmq_dead_queues — runtime sibling that flags any queue (including streams) not in running state.

  • ../test_rabbitmq.py::test_rabbitmq_node_status — broker-level health; failing here usually means stream checks also fail.

  • ../test_rabbitmq.py::test_rabbitmq_feature_flags — direct neighbor for upgrade gating.

  • test_mongodb.py::test_mongodb_replicaset — sibling pre-upgrade check for a different stateful clustered system; same validate-before-restart pattern.