domino_admin_toolkit.checks.test_kubernetes module
Description: Checks a number of kubernetes components health Result: Please see individual check for description and result
- domino_admin_toolkit.checks.test_kubernetes.test_component_status(component)
- Description:
Validate the scheduler, controller manager and etcd are healthy
- Result:
If any component is marked as failed then the test will fail
- domino_admin_toolkit.checks.test_kubernetes.test_container_restarts(ns, request)
- Description:
Checks for any container restarts within the last day, it displays the following information for the containers that restarted. * RESTARTS: The number of times a container has been restarted. * DAYS: The number of days since the pod was started. * INTERVAL: The number of days between each restart of the container. * LAST_RESTART: The number of days since the last restart of the container. * LAST_FINISHED: The date and time when the last instance of the container finished running. * EXIT_CODE: The exit code returned by the last instance of the container. * REASON: The cause of the last instance of the container’s termination.
- Result:
If any restarts have occurred it will flag a failure and return a list of container restarts and their error codes
- Public Facing KB:
https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/ https://kubernetes.io/docs/concepts/workloads/pods/#containers https://komodor.com/learn/exit-codes-in-containers-and-kubernetes-the-complete-guide/
- domino_admin_toolkit.checks.test_kubernetes.test_get_version()
- Description:
Checks we have a valid kubernetes version
- Result:
Shows current kubernetes version, fails if non returned
- domino_admin_toolkit.checks.test_kubernetes.test_k8s_api_server_request_rate(prometheus_client)
- Description:
Retrieve from Prometheus k8s request rate and latency information
- Result:
Check Request rate, with filter for 400 and 500 error code > 0.8 for unsuccessful requests
- domino_admin_toolkit.checks.test_kubernetes.test_k8s_api_server_responding()
- Description:
Tries to query the Kubernetes API server version to ensure the API server is responding to requests.
- Results:
Fails if there is a problem connecting to or querying the API server
- domino_admin_toolkit.checks.test_kubernetes.test_kernel_amazon_version()
- Description:
Checks amazon linux kernel memory leak bug in the EKS optimized AMI based on linux kernel version 5.10.x
- Result:
If kernel version falls within the range of “kernel-5.10.176-157.645.amzn2” and “kernel-5.10.177-158.645.amzn2”, the test will fail with a message indicating that the kernel might have a known bug.
- domino_admin_toolkit.checks.test_kubernetes.test_kernel_redhat_version()
- Description:
Checks Kernel version
- Result:
If kernel version returns 3.10.0-1062.4.1.el7 fails with known bug https://bugzilla.redhat.com/show_bug.cgi?id=1507149
- Public Facing KB:
- domino_admin_toolkit.checks.test_kubernetes.test_kube_dns_health()
- Description:
Retrieves kube-dns pods in running state
- Results:
Fails if there are no kube-dns pods in running phase
- domino_admin_toolkit.checks.test_kubernetes.test_node_health()
- Description:
Checks node health and is considered “failed” if met with any of these conditions: - The node is ready but status is not “True”. Can be stuck ‘pending’ in scaling, But not more than 30 miuntes - The total number of evicted pods is more than 10% of the total pods for a single node
- Result:
Table prints failed nodes: - Node name - Internal node IP - Node condition type - Node condition status - Node Label - All the node condition statuses
- domino_admin_toolkit.checks.test_kubernetes.test_os_version()
- Description:
Checks for node operating system version
- Result:
Shows current OS version, fails if nothing is returned
- domino_admin_toolkit.checks.test_kubernetes.test_pod_health()
- Description:
Checks to see if there are any pods are in the “Pending” or “Failed” status or a terminated container. Considered a fail pod if: - Status is “Pending”. If pod has been pending for over 30 minutes - Status is “Failed” - Container has been “Terminated”
- Result:
Table prints failed pods: - Pod name - Node name the pod is attached to - Pod IP - Pod namespace - Status reason for failure - Last pod event if failed from pending pods - If container failure related, the container termination reason
- domino_admin_toolkit.checks.test_kubernetes.verify_kernel_version(os_names, version_is_bad)
- Return type: