domino_admin_toolkit.checks.test_kube_events module

pydantic model domino_admin_toolkit.checks.test_kube_events.RegistryAuthenticationAnalyzer

Bases: AnalyzerBase

Analyzes registry authentication events to detect configuration and connectivity issues.

Evaluates:
  • OAuth token failures (401 errors)

  • Image pull failures (ImagePullBackOff, ErrImagePull)

  • CNI container failures related to registry issues

  • Registry unavailability

  • Authorization failures

Provides targeted remediation guidance based on failure patterns.

Fields:

analyze(data)

Analyze registry authentication event data.

Return type:

list[CheckResult]

Args:
data: Dictionary containing:
  • registry_endpoint: The registry URL

  • failure_type: Category of failure

  • event_count: Number of events

  • affected_pods: List of affected pod names

  • pod_count: Number of unique affected pods

  • first_seen: Earliest event timestamp

  • last_seen: Latest event timestamp

  • sample_message: Example error message

Returns:

List of CheckResult objects

name: ClassVar[str] = 'RegistryAuthenticationAnalyzer'
domino_admin_toolkit.checks.test_kube_events.log_events(events, category)
domino_admin_toolkit.checks.test_kube_events.registry_authentication_events(get_bad_pod_events, get_warning_cluster_events)

Collect and analyze registry authentication failure events.

Combines pod and cluster events to identify registry authentication issues including image pull failures, OAuth errors, and CNI container failures.

Returns:

DataFrame with categorized registry authentication issues

domino_admin_toolkit.checks.test_kube_events.test_kube_cluster_events(get_good_cluster_events, get_warning_cluster_events, event_type)
Description:

Checks all Kubernetes events, highlighting events that are not classified as type ‘Normal’.

Result:

Logs good and bad events based on their classification.

domino_admin_toolkit.checks.test_kube_events.test_kube_critical_pod_events(get_bad_pod_events)
Description:

Filters Kubernetes Pod events from the last two weeks and identifies critical events where Pods were OOMKilled.

Result:

Logs events categorized as “Critical” for Pods that were OOMKilled. Asserts that there are no OOMKilled events, indicating no critical memory issues in the cluster.

domino_admin_toolkit.checks.test_kube_events.test_kube_pod_events(get_bad_pod_events, get_good_pod_events, event_type)
Description:

Checks Kubernetes Pod events for readiness and liveness probe failures. Filters events from the last two weeks.

Result:

Logs categorized as “Good” for events without probe failures and “Bad” for events with multiple probe failures. Asserts that there are no events with multiple probe failures, indicating potential issues with Pod health.

domino_admin_toolkit.checks.test_kube_events.test_registry_authentication_events(registry_authentication_events)

Detects registry authentication failures from Kubernetes events and provides targeted remediation.

This check identifies and analyzes container registry authentication and connectivity issues that can cause pod failures, CNI problems, and platform instability. Following the pattern from OT-3523 (TIAA registry authentication failure), this check detects:

  • Direct image pull failures (ImagePullBackOff, ErrImagePull)

  • OAuth token authentication failures (401 errors)

  • Subtle istio-validation container failures indicating CNI issues

  • Registry unavailability and connectivity problems

  • Authorization failures and credential expiration

The check correlates events across pods and namespaces to identify registry-wide patterns and provides specific remediation guidance based on the failure type.

Failure Conditions:
  • OAuth 401 errors from registry endpoints (likely credential expiration)

  • ImagePullBackOff or ErrImagePull events (authentication or connectivity)

  • Back-off restarting failed container istio-validation (CNI registry issues)

  • Registry unavailable errors (infrastructure or network issues)

  • Authorization failures (credential misconfiguration)

Troubleshooting Steps:
  1. Identify affected registry endpoints from the results table

  2. Check registry credential secrets: kubectl get secret -n domino-platform | grep quay kubectl get secret -n domino-compute | grep quay

  3. Verify credential format and expiration: kubectl get secret <secret-name> -n domino-platform -o jsonpath=’{.data..dockerconfigjson}’ | base64 -d

  4. Test registry connectivity from cluster: kubectl run test-curl –rm -it –image=curlimages/curl – curl -I https://<registry-endpoint>

  5. For CNI issues, check istio-cni-node daemonset: kubectl get pods -n istio-system -l k8s-app=istio-cni-node kubectl logs -n istio-system <istio-cni-pod> –tail=100

  6. Review affected pod events for detailed error messages: kubectl describe pod <pod-name> -n <namespace>

Resolution Steps:
  1. For OAuth 401 errors (credential expiration): - Rotate registry credentials in domino-quay-repos secret - Update .dockerconfigjson with fresh tokens - Restart affected deployments to pick up new credentials - Verify: kubectl delete pod <affected-pod> -n <namespace>

  2. For ImagePullBackOff (authentication): - Verify imagePullSecrets are correctly referenced in pod specs - Ensure secrets exist in the correct namespaces (platform, compute, field) - Check registry URL matches exactly in both secret and image references - Test manual image pull: docker pull <full-image-path>

  3. For CNI container failures: - Ensure istio-system namespace has required registry credentials - Verify CNI daemonset can pull its init container images - Check istio-cni-node logs for specific registry errors - Restart istio-cni-node pods if credentials were updated

  4. For registry unavailability: - Verify DNS resolution for registry endpoint - Check network policies allow egress to registry - Confirm firewall rules permit HTTPS (443) to registry - Contact registry infrastructure team for registry-side issues

Required Permissions:
  • Read access to Kubernetes events cluster-wide

  • Read access to secrets in platform/compute namespaces (for troubleshooting)

  • Ability to describe pods and check logs (for troubleshooting)