domino_admin_toolkit.checks.test_domino_service_health module

pydantic model domino_admin_toolkit.checks.test_domino_service_health.ServiceHealthAnalyzer

Bases: AnalyzerBase

Analyzes health status of Domino services from nucleus-frontend health endpoint.

Evaluates:
  • Individual service health status from nucleus-frontend health endpoint

  • Service status mapping (green=pass, red/critical=fail, yellow/warning=warn)

  • Error handling for malformed or missing health data

Fields:

analyze(data)

Analyze health data from nucleus-frontend health endpoint

Return type:

list[CheckResult]

Args:
data: DataFrame containing parsed health data with columns:
  • name: Service name

  • status: Health status (typically “green”, “red”, “yellow”)

  • message: Optional status message

  • type: Type of check (e.g., “Health”)

Returns:

List of CheckResult objects for each service

name: ClassVar[str] = 'ServiceHealthAnalyzer'
domino_admin_toolkit.checks.test_domino_service_health.service_health_data()

Collect service health data from nucleus-frontend

domino_admin_toolkit.checks.test_domino_service_health.test_domino_service_health(domino_version_string, service_health_data)

Validates the health status of core Domino platform services via nucleus-frontend health endpoint.

This check queries the nucleus-frontend health endpoint to verify operational status of:

  • Redis: Short-term caching

  • Vault: Secrets management

  • MongoDB: Database connectivity and data layer

  • Kubernetes: Cluster communication and resource management

  • RabbitMQ: Message queuing and inter-service communication

  • Platform services rollout status: nucleus-dispatcher, nucleus-develop, nucleus-train, model-hosting, pham-model-serving-service, nucleus-frontend

Failure Conditions:
  • nucleus-frontend health endpoint unreachable or returning errors

  • Individual services reporting unhealthy status (red/critical) in health response

  • Critical service dependencies showing degraded performance (yellow/warning)

  • Platform service rollouts showing deployment issues

Troubleshooting Steps:
  1. Verify nucleus-frontend pod is running and accessible via service discovery

  2. Check individual service logs for error messages and root causes: - Redis connection and performance issues - Vault authentication and secret access problems - MongoDB connectivity and query performance - Kubernetes API server communication issues - RabbitMQ message queue processing delays

  3. Review platform service rollout status for deployment failures

  4. Validate network policies and service mesh configuration

Resolution Steps:
  1. For nucleus-frontend connectivity issues: - Restart nucleus-frontend pods if unresponsive - Check service discovery and load balancer configuration - Verify network policies allow health endpoint access

  2. For individual service failures: - Review specific service logs and restart unhealthy pods - Check resource constraints and scaling configuration - Validate service dependencies and integration endpoints

  3. For platform rollout issues: - Check deployment status and pod readiness - Review resource availability and scheduling constraints - Validate service configurations and environment variables