domino_admin_toolkit.checks.test_aws_cloud module

pydantic model domino_admin_toolkit.checks.test_aws_cloud.ASGNeuronTagAnalyzer

Bases: AnalyzerBase

Validates that Auto Scaling Groups have the required neuron tag

Fields:

required_neuron_tag (str)

field required_neuron_tag: str = 'k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron': The required neuron resource tag

analyze(data)

Core analysis logic

Return type:: list[CheckResult]

name: ClassVar[str] = 'ASGNeuronTagAnalyzer'

domino_admin_toolkit.checks.test_aws_cloud.asg_data(aws_region)

Return type:: DataFrame

domino_admin_toolkit.checks.test_aws_cloud.get_asg_data(aws_region)

Retrieves Auto Scaling Group data from AWS

Return type:: DataFrame

domino_admin_toolkit.checks.test_aws_cloud.test_asg_hwt_matches(domino_api_client, k8s_client, aws_region, skip_karpenter_enabled)

Return type:: None

Description:: Checks that all hardware tiers have a corresponding autoscaling group with the right node-pool tag. Not applicable to Karpenter-enabled deployments.
Result:: The toolkit check displays all hardware tiers and their corresponding autoscaling groups.

domino_admin_toolkit.checks.test_aws_cloud.test_asg_missing_neuron_tags(asg_data, skip_karpenter_enabled)

Return type:: None

Description:

Checks that all Trainium and Inferentia Auto Scaling Groups (ASGs) have the required neuron resource tag for proper cluster autoscaling of ML workloads. Not applicable to Karpenter-enabled deployments.

Trainium (trn1) and Inferentia (inf1, inf2) instances require the neuron tag so that the cluster autoscaler knows these nodes provide AWS Neuron resources for ML training and inference workloads.

Required tag for Trainium/Inferentia ASGs: * k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron

Result:

The toolkit check displays all Trainium/Inferentia ASGs and validates they have the required neuron resource tag. Fails if any are missing the tag.

domino_admin_toolkit.checks.test_aws_cloud.test_asg_missing_tags(domino_api_client, aws_region, skip_karpenter_enabled, k8s_client)

Description: :rtype: None

The cluster autoscaler depends on Auto Scaling Groups (ASGs) tagging for discovery and scaling of nodes. If some tags are missing revisit and evaluate the ASG tags as this could be the reason for scaling problems Not applicable to Karpenter-enabled deployments.

Required tags in AWS:

k8s.io/cluster-autoscaler/enabled

k8s.io/cluster-autoscaler/{cluster name}

Required tags in EKS (1.24+) if scaling up “from 0 nodes”:

k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone = <AWS AZ’s>

k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/domino-node

k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool

Optional tags:

k8s.io/cluster-autoscaler/node-template/resources/smarter-devices/fuse

Note:

Specific tag keys such as “k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/domino-node,” for EKS managed nodegroups are generated from the nodes’ labels via their kubelet configuration. This overrides the tag values set in the autoscaling group. As a result, explicitly setting these tags in the autoscaling group is unnecessary.

Result:

The toolkit check displays all Auto Scaling Groups (ASGs), while also validating them for required tags.

Public Facing KB:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md

domino_admin_toolkit.checks.test_aws_cloud.test_unattached_ebs_volumes(aws_region, k8s_client)

Return type:: None

Description:: Check if there are available and unattached EBS volumes on AWS.
Result:: Show detailed information about the available volumes over AVAILABLE_EBS_VOLUME_THRESHOLD Gb in size. Currently, this threshold is set to 100GiB. The volume is considered available if it doesn’t have the ‘Attachments’ key and its state is ‘available’.