domino_admin_toolkit.checks.test_karpenter_capacity module
- domino_admin_toolkit.checks.test_karpenter_capacity.hardware_tier_mapping_data()
Retrieves hardware tier to NodePool mapping configuration from Kubernetes API.
Returns mapping between hardware tiers and NodePool configurations including taints and instance types. Requires read access to Karpenter NodePools and core Node resources.
- domino_admin_toolkit.checks.test_karpenter_capacity.karpenter_capacity_data()
Retrieves Karpenter NodeClaim and NodePool resources from the Kubernetes custom objects API.
Returns capacity information including requested vs actual resources and node readiness status. Requires cluster-admin permissions to access karpenter.sh/v1 custom resources.
- domino_admin_toolkit.checks.test_karpenter_capacity.nodepool_usage_data()
Retrieves NodePool resource usage metrics from Karpenter custom resources.
Returns current resource utilization per NodePool for capacity monitoring. Requires read access to karpenter.sh custom resources and node metadata.
- domino_admin_toolkit.checks.test_karpenter_capacity.test_hardware_tier_mapping(hardware_tier_mapping_data, karpenter_enabled)
- Purpose:
Validates proper mapping between hardware tiers and Karpenter NodePools for workload scheduling. Also identifies NodePools that contain nodes originally provisioned by legacy infrastructure.
- Data Interpretation:
“Has Migrated Nodes”: True means this NodePool contains nodes that were originally created by legacy infrastructure (EKS NodeGroups, Cluster Autoscaler ASGs, eksctl) but are now managed by Karpenter (indicates a migration scenario)
False means all nodes were provisioned natively by Karpenter (clean deployment)
- Migration Sources:
EKS-NG: Traditional EKS Managed NodeGroups
CA-ASG: Cluster Autoscaler Auto Scaling Groups
eksctl: eksctl-managed NodeGroups
- Failure Conditions:
GPU NodePools lack nvidia.com/gpu taints for proper scheduling
Neuron NodePools lack aws.amazon.com/neuron taints for proper scheduling
- Troubleshooting Steps:
Check NodePool taint configuration via kubectl get nodepools -o yaml
Verify instance type requirements via AWS console or CLI
Review workload scheduling patterns via kubectl describe pods
For migrated nodes, verify node labels with kubectl get nodes –show-labels
Check legacy infrastructure cleanup via kubectl get nodes -l eks.amazonaws.com/nodegroup
- Resolution Guidance:
For missing hardware tiers, create NodePools with appropriate instance types
Add required taints to specialized hardware NodePools for workload isolation
Update workload tolerations to match NodePool taints
For migration scenarios, ensure proper coordination between Karpenter and legacy infrastructure
Plan phased migration from Cluster Autoscaler/EKS NodeGroups to native Karpenter
- domino_admin_toolkit.checks.test_karpenter_capacity.test_karpenter_nodeclaim_resource_efficiency(karpenter_capacity_data, karpenter_enabled)
- Purpose:
Validates Karpenter NodeClaim resource efficiency and node provisioning health
- Failure Conditions:
CPU or memory efficiency below 1% (extremely wasteful)
Overprovisioning ratios exceed 100x (cost control)
NodeClaims remain in non-ready state after provisioning
- Troubleshooting Steps:
Check Karpenter controller logs via kubectl logs -n karpenter
Verify NodePool requirements match available instance types via kubectl get nodepools
Examine NodeClaim status conditions via kubectl describe nodeclaims
- Resolution Guidance:
For consistent over-provisioning, adjust NodePool requirements constraints
If instance types are unavailable, expand the allowed instance type list
For provisioning failures, check IAM permissions and EC2 service limits
- domino_admin_toolkit.checks.test_karpenter_capacity.test_nodepool_capacity_utilization(nodepool_usage_data, karpenter_enabled)
- Purpose:
Monitors NodePool capacity utilization to prevent job scheduling failures due to resource exhaustion
- Failure Conditions:
CPU utilization exceeds 80% warning threshold
Memory utilization exceeds 80% warning threshold
CPU or memory utilization exceeds 90% critical threshold
NodePool limits are set to zero or undefined
- Troubleshooting Steps:
Check NodePool specifications via kubectl get nodepools -o yaml
Verify current workload resource requests via kubectl top nodes
Monitor job scheduling failures via kubectl get events –field-selector type=Warning
- Resolution Guidance:
For high utilization, increase NodePool resource limits in cluster configuration
Review workload resource requests to ensure they match actual usage patterns