domino_admin_toolkit.checks.test_kubernetes module

Description: Checks a number of kubernetes components health Result: Please see individual check for description and result

domino_admin_toolkit.checks.test_kubernetes.test_component_status(component)
Description:

Validate the scheduler, controller manager and etcd are healthy

Result:

If any component is marked as failed then the test will fail

domino_admin_toolkit.checks.test_kubernetes.test_container_restarts(ns, request)
Description:

Checks for any container restarts within the last day, it displays the following information for the containers that restarted. * RESTARTS: The number of times a container has been restarted. * DAYS: The number of days since the pod was started. * INTERVAL: The number of days between each restart of the container. * LAST_RESTART: The number of days since the last restart of the container. * LAST_FINISHED: The date and time when the last instance of the container finished running. * EXIT_CODE: The exit code returned by the last instance of the container. * REASON: The cause of the last instance of the container’s termination.

Result:

If any restarts have occurred it will flag a failure and return a list of container restarts and their error codes

Public Facing KB:

https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/ https://kubernetes.io/docs/concepts/workloads/pods/#containers https://komodor.com/learn/exit-codes-in-containers-and-kubernetes-the-complete-guide/

domino_admin_toolkit.checks.test_kubernetes.test_get_version()
Description:

Checks we have a valid kubernetes version

Result:

Shows current kubernetes version, fails if non returned

domino_admin_toolkit.checks.test_kubernetes.test_k8s_api_server_request_rate(prometheus_client)
Description:

Retrieve from Prometheus k8s request rate and latency information

Result:

Check Request rate, with filter for 400 and 500 error code > 0.8 for unsuccessful requests

domino_admin_toolkit.checks.test_kubernetes.test_k8s_api_server_responding()
Description:

Tries to query the Kubernetes API server version to ensure the API server is responding to requests.

Results:

Fails if there is a problem connecting to or querying the API server

domino_admin_toolkit.checks.test_kubernetes.test_kernel_amazon_version()
Description:

Checks amazon linux kernel memory leak bug in the EKS optimized AMI based on linux kernel version 5.10.x

Result:

If kernel version falls within the range of “kernel-5.10.176-157.645.amzn2” and “kernel-5.10.177-158.645.amzn2”, the test will fail with a message indicating that the kernel might have a known bug.

domino_admin_toolkit.checks.test_kubernetes.test_kernel_redhat_version()
Description:

Checks Kernel version

Result:

If kernel version returns 3.10.0-1062.4.1.el7 fails with known bug https://bugzilla.redhat.com/show_bug.cgi?id=1507149

Public Facing KB:

https://bugzilla.redhat.com/show_bug.cgi?id=1507149

domino_admin_toolkit.checks.test_kubernetes.test_kube_dns_health()
Description:

Retrieves kube-dns pods in running state

Results:

Fails if there are no kube-dns pods in running phase

domino_admin_toolkit.checks.test_kubernetes.test_node_health()
Description:

Checks node health and is considered “failed” if met with any of these conditions: - The node is ready but status is not “True”. Can be stuck ‘pending’ in scaling, But not more than 30 miuntes - The total number of evicted pods is more than 10% of the total pods for a single node

Result:

Table prints failed nodes: - Node name - Internal node IP - Node condition type - Node condition status - Node Label - All the node condition statuses

domino_admin_toolkit.checks.test_kubernetes.test_os_version()
Description:

Checks for node operating system version

Result:

Shows current OS version, fails if nothing is returned

domino_admin_toolkit.checks.test_kubernetes.test_pod_health()
Description:

Checks to see if there are any pods are in the “Pending” or “Failed” status or a terminated container. Considered a fail pod if: - Status is “Pending”. If pod has been pending for over 30 minutes - Status is “Failed” - Container has been “Terminated”

Result:

Table prints failed pods: - Pod name - Node name the pod is attached to - Pod IP - Pod namespace - Status reason for failure - Last pod event if failed from pending pods - If container failure related, the container termination reason

domino_admin_toolkit.checks.test_kubernetes.verify_kernel_version(os_names, version_is_bad)
Return type:

None

domino_admin_toolkit.checks.test_kubernetes.versiontuple(v)
Return type:

tuple