domino_admin_toolkit.checks.test_karpenter_capacity module

domino_admin_toolkit.checks.test_karpenter_capacity.hardware_tier_mapping_data()

Retrieves hardware tier to NodePool mapping configuration from Kubernetes API.

Returns mapping between hardware tiers and NodePool configurations including taints and instance types. Requires read access to Karpenter NodePools and core Node resources.

domino_admin_toolkit.checks.test_karpenter_capacity.karpenter_capacity_data()

Retrieves Karpenter NodeClaim and NodePool resources from the Kubernetes custom objects API.

Returns capacity information including requested vs actual resources and node readiness status. Requires cluster-admin permissions to access karpenter.sh/v1 custom resources.

domino_admin_toolkit.checks.test_karpenter_capacity.nodepool_usage_data()

Retrieves NodePool resource usage metrics from Karpenter custom resources.

Returns current resource utilization per NodePool for capacity monitoring. Requires read access to karpenter.sh custom resources and node metadata.

domino_admin_toolkit.checks.test_karpenter_capacity.test_hardware_tier_mapping(hardware_tier_mapping_data, karpenter_enabled)
Purpose:

Validates proper mapping between hardware tiers and Karpenter NodePools for workload scheduling. Also identifies NodePools that contain nodes originally provisioned by legacy infrastructure.

Data Interpretation:
  • “Has Migrated Nodes”: True means this NodePool contains nodes that were originally created by legacy infrastructure (EKS NodeGroups, Cluster Autoscaler ASGs, eksctl) but are now managed by Karpenter (indicates a migration scenario)

  • False means all nodes were provisioned natively by Karpenter (clean deployment)

Migration Sources:
  • EKS-NG: Traditional EKS Managed NodeGroups

  • CA-ASG: Cluster Autoscaler Auto Scaling Groups

  • eksctl: eksctl-managed NodeGroups

Failure Conditions:
  • GPU NodePools lack nvidia.com/gpu taints for proper scheduling

  • Neuron NodePools lack aws.amazon.com/neuron taints for proper scheduling

Troubleshooting Steps:
  1. Check NodePool taint configuration via kubectl get nodepools -o yaml

  2. Verify instance type requirements via AWS console or CLI

  3. Review workload scheduling patterns via kubectl describe pods

  4. For migrated nodes, verify node labels with kubectl get nodes –show-labels

  5. Check legacy infrastructure cleanup via kubectl get nodes -l eks.amazonaws.com/nodegroup

Resolution Guidance:
  1. For missing hardware tiers, create NodePools with appropriate instance types

  2. Add required taints to specialized hardware NodePools for workload isolation

  3. Update workload tolerations to match NodePool taints

  4. For migration scenarios, ensure proper coordination between Karpenter and legacy infrastructure

  5. Plan phased migration from Cluster Autoscaler/EKS NodeGroups to native Karpenter

domino_admin_toolkit.checks.test_karpenter_capacity.test_karpenter_nodeclaim_resource_efficiency(karpenter_capacity_data, karpenter_enabled)
Purpose:

Validates Karpenter NodeClaim resource efficiency and node provisioning health

Failure Conditions:
  • CPU or memory efficiency below 1% (extremely wasteful)

  • Overprovisioning ratios exceed 100x (cost control)

  • NodeClaims remain in non-ready state after provisioning

Troubleshooting Steps:
  1. Check Karpenter controller logs via kubectl logs -n karpenter

  2. Verify NodePool requirements match available instance types via kubectl get nodepools

  3. Examine NodeClaim status conditions via kubectl describe nodeclaims

Resolution Guidance:
  1. For consistent over-provisioning, adjust NodePool requirements constraints

  2. If instance types are unavailable, expand the allowed instance type list

  3. For provisioning failures, check IAM permissions and EC2 service limits

domino_admin_toolkit.checks.test_karpenter_capacity.test_nodepool_capacity_utilization(nodepool_usage_data, karpenter_enabled)
Purpose:

Monitors NodePool capacity utilization to prevent job scheduling failures due to resource exhaustion

Failure Conditions:
  • CPU utilization exceeds 80% warning threshold

  • Memory utilization exceeds 80% warning threshold

  • CPU or memory utilization exceeds 90% critical threshold

  • NodePool limits are set to zero or undefined

Troubleshooting Steps:
  1. Check NodePool specifications via kubectl get nodepools -o yaml

  2. Verify current workload resource requests via kubectl top nodes

  3. Monitor job scheduling failures via kubectl get events –field-selector type=Warning

Resolution Guidance:
  1. For high utilization, increase NodePool resource limits in cluster configuration

  2. Review workload resource requests to ensure they match actual usage patterns