domino_admin_toolkit.checks.test_kube_events module
- pydantic model domino_admin_toolkit.checks.test_kube_events.RegistryAuthenticationAnalyzer
Bases:
AnalyzerBaseAnalyzes registry authentication events to detect configuration and connectivity issues.
- Evaluates:
OAuth token failures (401 errors)
Image pull failures (ImagePullBackOff, ErrImagePull)
CNI container failures related to registry issues
Registry unavailability
Authorization failures
Provides targeted remediation guidance based on failure patterns.
- Fields:
- analyze(data)
Analyze registry authentication event data.
- Return type:
- Args:
- data: Dictionary containing:
registry_endpoint: The registry URL
failure_type: Category of failure
event_count: Number of events
affected_pods: List of affected pod names
pod_count: Number of unique affected pods
first_seen: Earliest event timestamp
last_seen: Latest event timestamp
sample_message: Example error message
- Returns:
List of CheckResult objects
- name: ClassVar[str] = 'RegistryAuthenticationAnalyzer'
- domino_admin_toolkit.checks.test_kube_events.log_events(events, category)
- domino_admin_toolkit.checks.test_kube_events.registry_authentication_events(get_bad_pod_events, get_warning_cluster_events)
Collect and analyze registry authentication failure events.
Combines pod and cluster events to identify registry authentication issues including image pull failures, OAuth errors, and CNI container failures.
- Returns:
DataFrame with categorized registry authentication issues
- domino_admin_toolkit.checks.test_kube_events.test_kube_cluster_events(get_good_cluster_events, get_warning_cluster_events, event_type)
- Description:
Checks all Kubernetes events, highlighting events that are not classified as type ‘Normal’.
- Result:
Logs good and bad events based on their classification.
- domino_admin_toolkit.checks.test_kube_events.test_kube_critical_pod_events(get_bad_pod_events)
- Description:
Filters Kubernetes Pod events from the last two weeks and identifies critical events where Pods were OOMKilled.
- Result:
Logs events categorized as “Critical” for Pods that were OOMKilled. Asserts that there are no OOMKilled events, indicating no critical memory issues in the cluster.
- domino_admin_toolkit.checks.test_kube_events.test_kube_pod_events(get_bad_pod_events, get_good_pod_events, event_type)
- Description:
Checks Kubernetes Pod events for readiness and liveness probe failures. Filters events from the last two weeks.
- Result:
Logs categorized as “Good” for events without probe failures and “Bad” for events with multiple probe failures. Asserts that there are no events with multiple probe failures, indicating potential issues with Pod health.
- domino_admin_toolkit.checks.test_kube_events.test_registry_authentication_events(registry_authentication_events)
Detects registry authentication failures from Kubernetes events and provides targeted remediation.
This check identifies and analyzes container registry authentication and connectivity issues that can cause pod failures, CNI problems, and platform instability. Following the pattern from OT-3523 (TIAA registry authentication failure), this check detects:
Direct image pull failures (ImagePullBackOff, ErrImagePull)
OAuth token authentication failures (401 errors)
Subtle istio-validation container failures indicating CNI issues
Registry unavailability and connectivity problems
Authorization failures and credential expiration
The check correlates events across pods and namespaces to identify registry-wide patterns and provides specific remediation guidance based on the failure type.
- Failure Conditions:
OAuth 401 errors from registry endpoints (likely credential expiration)
ImagePullBackOff or ErrImagePull events (authentication or connectivity)
Back-off restarting failed container istio-validation (CNI registry issues)
Registry unavailable errors (infrastructure or network issues)
Authorization failures (credential misconfiguration)
- Troubleshooting Steps:
Identify affected registry endpoints from the results table
Check registry credential secrets: kubectl get secret -n domino-platform | grep quay kubectl get secret -n domino-compute | grep quay
Verify credential format and expiration: kubectl get secret <secret-name> -n domino-platform -o jsonpath=’{.data..dockerconfigjson}’ | base64 -d
Test registry connectivity from cluster: kubectl run test-curl –rm -it –image=curlimages/curl – curl -I https://<registry-endpoint>
For CNI issues, check istio-cni-node daemonset: kubectl get pods -n istio-system -l k8s-app=istio-cni-node kubectl logs -n istio-system <istio-cni-pod> –tail=100
Review affected pod events for detailed error messages: kubectl describe pod <pod-name> -n <namespace>
- Resolution Steps:
For OAuth 401 errors (credential expiration): - Rotate registry credentials in domino-quay-repos secret - Update .dockerconfigjson with fresh tokens - Restart affected deployments to pick up new credentials - Verify: kubectl delete pod <affected-pod> -n <namespace>
For ImagePullBackOff (authentication): - Verify imagePullSecrets are correctly referenced in pod specs - Ensure secrets exist in the correct namespaces (platform, compute, field) - Check registry URL matches exactly in both secret and image references - Test manual image pull: docker pull <full-image-path>
For CNI container failures: - Ensure istio-system namespace has required registry credentials - Verify CNI daemonset can pull its init container images - Check istio-cni-node logs for specific registry errors - Restart istio-cni-node pods if credentials were updated
For registry unavailability: - Verify DNS resolution for registry endpoint - Check network policies allow egress to registry - Confirm firewall rules permit HTTPS (443) to registry - Contact registry infrastructure team for registry-side issues
- Required Permissions:
Read access to Kubernetes events cluster-wide
Read access to secrets in platform/compute namespaces (for troubleshooting)
Ability to describe pods and check logs (for troubleshooting)