test_kubernetes module
Description: Checks a number of kubernetes components health Result: Please see individual check for description and result
- test_kubernetes.test_component_status(component, status)
- Description:
Validate the scheduler, controller manager and etcd are healthy
- Result:
If any component is marked as failed then the test will fail
- test_kubernetes.test_container_restarts(ns)
- Description:
Checks for any container restarts within the last day, it displays the following information for the containers that restarted.
RESTARTS: The number of times a container has been restarted.
DAYS: The number of days since the pod was started.
INTERVAL: The number of days between each restart of the container.
LAST_RESTART: The number of days since the last restart of the container.
LAST_FINISHED: The date and time when the last instance of the container finished running.
EXIT_CODE: The exit code returned by the last instance of the container.
REASON: The cause of the last instance of the container’s termination.
- Result:
If any restarts have occurred it will flag a failure and return a list of container restarts and their error codes
- Public Facing KB:
https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/ https://kubernetes.io/docs/concepts/workloads/pods/#containers https://komodor.com/learn/exit-codes-in-containers-and-kubernetes-the-complete-guide/
- test_kubernetes.test_get_version()
- Description:
Checks we have a valid kubernetes version
- Result:
Shows current kubernetes version, fails if non returned
- test_kubernetes.test_k8s_api_server_request_rate()
- Description:
Retrieve from Prometheus k8s request rate and latency information
- Result:
Check Request rate, with filter for 400 and 500 error code > 0.8 for unsuccessful requests
- test_kubernetes.test_k8s_api_server_responding()
- Description:
Tries to query the Kubernetes API server version to ensure the API server is responding to requests.
- Results:
Fails if there is a problem connecting to or querying the API server
- test_kubernetes.test_kernel_version()
- Description:
Checks Kernel version
- Result:
If kernel version returns 3.10.0-1062.4.1.el7 fails with known bug https://bugzilla.redhat.com/show_bug.cgi?id=1507149
- Public Facing KB:
- test_kubernetes.test_kube_dns_health()
- Description:
Retrieves kube-dns pods in running state
- Results:
Fails if there are no kube-dns pods in running phase
- test_kubernetes.test_node_health()
Description:
Checks node health and is considered “failed” if met with any of these conditions: - The node is ready but status is not “True”. Can be stuck ‘pending’ in scaling, But not more than 30 miuntes - The total number of evicted pods is more than 10% of the total pods for a single node
Result:
Table prints failed nodes: - Node name - Internal node IP - Node condition type - Node condition status - Node Label - All the node condition statuses
- test_kubernetes.test_os_version()
- Description:
Checks for node operating system version
- Result:
Shows current OS version, fails if nothing is returned
- test_kubernetes.test_pod_health()
- Description:
Checks to see if there are any pods are in the “Pending” or “Failed” status or a terminated container. Considered a fail pod if: - Status is “Pending”. If pod has been pending for over 30 minutes - Status is “Failed” - Container has been “Terminated”
- Result:
Table prints failed pods: - Pod name - Node name the pod is attached to - Pod IP - Pod namespace - Status reason for failure - Last pod event if failed from pending pods - If container failure related, the container termination reason
- test_kubernetes.versiontuple(v)