domino_admin_toolkit.checks.test_aws_cloud module
- pydantic model domino_admin_toolkit.checks.test_aws_cloud.ASGNeuronTagAnalyzer
Bases:
AnalyzerBase
Validates that Auto Scaling Groups have the required neuron tag
- Fields:
-
field required_neuron_tag:
str
= 'k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron' The required neuron resource tag
- analyze(data)
Core analysis logic
- Return type:
- name: ClassVar[str] = 'ASGNeuronTagAnalyzer'
- domino_admin_toolkit.checks.test_aws_cloud.get_asg_data(aws_region)
Retrieves Auto Scaling Group data from AWS
- Return type:
- domino_admin_toolkit.checks.test_aws_cloud.test_asg_hwt_matches(aws_region, skip_karpenter_enabled)
- Return type:
- Description:
Checks that all hardware tiers have a corresponding autoscaling group with the right node-pool tag. Not applicable to Karpenter-enabled deployments.
- Result:
The toolkit check displays all hardware tiers and their corresponding autoscaling groups.
- domino_admin_toolkit.checks.test_aws_cloud.test_asg_missing_neuron_tags(asg_data, skip_karpenter_enabled)
- Return type:
- Description:
Checks that all Trainium and Inferentia Auto Scaling Groups (ASGs) have the required neuron resource tag for proper cluster autoscaling of ML workloads. Not applicable to Karpenter-enabled deployments.
Trainium (trn1) and Inferentia (inf1, inf2) instances require the neuron tag so that the cluster autoscaler knows these nodes provide AWS Neuron resources for ML training and inference workloads.
Required tag for Trainium/Inferentia ASGs: * k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron
- Result:
The toolkit check displays all Trainium/Inferentia ASGs and validates they have the required neuron resource tag. Fails if any are missing the tag.
- domino_admin_toolkit.checks.test_aws_cloud.test_asg_missing_tags(aws_region, skip_karpenter_enabled)
Description: :rtype:
None
The cluster autoscaler depends on Auto Scaling Groups (ASGs) tagging for discovery and scaling of nodes. If some tags are missing revisit and evaluate the ASG tags as this could be the reason for scaling problems Not applicable to Karpenter-enabled deployments.
Required tags in AWS:
k8s.io/cluster-autoscaler/enabled
k8s.io/cluster-autoscaler/{cluster name}
Required tags in EKS (1.24+) if scaling up “from 0 nodes”:
k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone = <AWS AZ’s>
k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/domino-node
k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool
Optional tags:
k8s.io/cluster-autoscaler/node-template/resources/smarter-devices/fuse
Note:
Specific tag keys such as “k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/domino-node,” for EKS managed nodegroups are generated from the nodes’ labels via their kubelet configuration. This overrides the tag values set in the autoscaling group. As a result, explicitly setting these tags in the autoscaling group is unnecessary.
Result:
The toolkit check displays all Auto Scaling Groups (ASGs), while also validating them for required tags.
Public Facing KB:
- domino_admin_toolkit.checks.test_aws_cloud.test_unattached_ebs_volumes(aws_region)
- Return type:
- Description:
Check if there are available and unattached EBS volumes on AWS.
- Result:
Show detailed information about the available volumes over AVAILABLE_EBS_VOLUME_THRESHOLD Gb in size. Currently, this threshold is set to 100GiB. The volume is considered available if it doesn’t have the ‘Attachments’ key and its state is ‘available’.