domino_admin_toolkit.checks.test_domino_service_health module

pydantic model domino_admin_toolkit.checks.test_domino_service_health.ServiceHealthAnalyzer

Analyzes health status of Domino services from nucleus-frontend health endpoint.

Evaluates:

analyze(data)

Analyze health data from nucleus-frontend health endpoint

Args:

data: DataFrame containing parsed health data with columns:

Returns:

List of CheckResult objects for each service

domino_admin_toolkit.checks.test_domino_service_health.service_health_data(): Collect service health data from nucleus-frontend

domino_admin_toolkit.checks.test_domino_service_health.test_domino_service_health(domino_version_string, service_health_data)

Validates the health status of core Domino platform services via nucleus-frontend health endpoint.

This check queries the nucleus-frontend health endpoint to verify operational status of:

Redis: Short-term caching
Vault: Secrets management
MongoDB: Database connectivity and data layer
Kubernetes: Cluster communication and resource management
RabbitMQ: Message queuing and inter-service communication
Platform services rollout status: nucleus-dispatcher, nucleus-develop, nucleus-train, model-hosting, pham-model-serving-service, nucleus-frontend

Failure Conditions:

nucleus-frontend health endpoint unreachable or returning errors
Individual services reporting unhealthy status (red/critical) in health response
Critical service dependencies showing degraded performance (yellow/warning)
Platform service rollouts showing deployment issues

Troubleshooting Steps:

Verify nucleus-frontend pod is running and accessible via service discovery
Check individual service logs for error messages and root causes: - Redis connection and performance issues - Vault authentication and secret access problems - MongoDB connectivity and query performance - Kubernetes API server communication issues - RabbitMQ message queue processing delays
Review platform service rollout status for deployment failures
Validate network policies and service mesh configuration

Resolution Steps:

For nucleus-frontend connectivity issues: - Restart nucleus-frontend pods if unresponsive - Check service discovery and load balancer configuration - Verify network policies allow health endpoint access
For individual service failures: - Review specific service logs and restart unhealthy pods - Check resource constraints and scaling configuration - Validate service dependencies and integration endpoints
For platform rollout issues: - Check deployment status and pod readiness - Review resource availability and scheduling constraints - Validate service configurations and environment variables