domino_admin_toolkit.checks.test_domino_service_health module
- pydantic model domino_admin_toolkit.checks.test_domino_service_health.ServiceHealthAnalyzer
Bases:
AnalyzerBase
Analyzes health status of Domino services from nucleus-frontend health endpoint.
- Evaluates:
Individual service health status from nucleus-frontend health endpoint
Service status mapping (green=pass, red/critical=fail, yellow/warning=warn)
Error handling for malformed or missing health data
- Fields:
- analyze(data)
Analyze health data from nucleus-frontend health endpoint
- Return type:
- Args:
- data: DataFrame containing parsed health data with columns:
name: Service name
status: Health status (typically “green”, “red”, “yellow”)
message: Optional status message
type: Type of check (e.g., “Health”)
- Returns:
List of CheckResult objects for each service
- name: ClassVar[str] = 'ServiceHealthAnalyzer'
- domino_admin_toolkit.checks.test_domino_service_health.service_health_data()
Collect service health data from nucleus-frontend
- domino_admin_toolkit.checks.test_domino_service_health.test_domino_service_health(domino_version_string, service_health_data)
Validates the health status of core Domino platform services via nucleus-frontend health endpoint.
This check queries the nucleus-frontend health endpoint to verify operational status of:
Redis: Short-term caching
Vault: Secrets management
MongoDB: Database connectivity and data layer
Kubernetes: Cluster communication and resource management
RabbitMQ: Message queuing and inter-service communication
Platform services rollout status: nucleus-dispatcher, nucleus-develop, nucleus-train, model-hosting, pham-model-serving-service, nucleus-frontend
- Failure Conditions:
nucleus-frontend health endpoint unreachable or returning errors
Individual services reporting unhealthy status (red/critical) in health response
Critical service dependencies showing degraded performance (yellow/warning)
Platform service rollouts showing deployment issues
- Troubleshooting Steps:
Verify nucleus-frontend pod is running and accessible via service discovery
Check individual service logs for error messages and root causes: - Redis connection and performance issues - Vault authentication and secret access problems - MongoDB connectivity and query performance - Kubernetes API server communication issues - RabbitMQ message queue processing delays
Review platform service rollout status for deployment failures
Validate network policies and service mesh configuration
- Resolution Steps:
For nucleus-frontend connectivity issues: - Restart nucleus-frontend pods if unresponsive - Check service discovery and load balancer configuration - Verify network policies allow health endpoint access
For individual service failures: - Review specific service logs and restart unhealthy pods - Check resource constraints and scaling configuration - Validate service dependencies and integration endpoints
For platform rollout issues: - Check deployment status and pod readiness - Review resource availability and scheduling constraints - Validate service configurations and environment variables