domino_admin_toolkit.checks.test_aws_cloud module
- pydantic model domino_admin_toolkit.checks.test_aws_cloud.ASGNeuronTagAnalyzer
Bases:
AnalyzerBaseValidates that Auto Scaling Groups have the required neuron tag
- Fields:
-
field required_neuron_tag:
str= 'k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron' The required neuron resource tag
- analyze(data)
Core analysis logic
- Return type:
- name: ClassVar[str] = 'ASGNeuronTagAnalyzer'
- pydantic model domino_admin_toolkit.checks.test_aws_cloud.EBSVolumeAnalyzer
Bases:
AnalyzerBaseAnalyzes unattached EBS volumes for cost optimization opportunities
- analyze(data)
Analyze EBS volume data for cost optimization
- Return type:
- name: ClassVar[str] = 'EBSVolumeAnalyzer'
- domino_admin_toolkit.checks.test_aws_cloud.ebs_volume_data(aws_region, k8s_client)
Collect EBS volume data
- Return type:
- domino_admin_toolkit.checks.test_aws_cloud.get_asg_data(aws_region)
Retrieves Auto Scaling Group data from AWS
- Return type:
- domino_admin_toolkit.checks.test_aws_cloud.test_asg_hwt_matches(domino_api_client, k8s_client, aws_region, skip_karpenter_enabled)
- Return type:
- Description:
Checks that all hardware tiers have a corresponding autoscaling group with the right node-pool tag. Not applicable to Karpenter-enabled deployments.
- Result:
The toolkit check displays all hardware tiers and their corresponding autoscaling groups.
- domino_admin_toolkit.checks.test_aws_cloud.test_asg_missing_neuron_tags(asg_data, skip_karpenter_enabled)
- Return type:
- Description:
Checks that all Trainium and Inferentia Auto Scaling Groups (ASGs) have the required neuron resource tag for proper cluster autoscaling of ML workloads. Not applicable to Karpenter-enabled deployments.
Trainium (trn1) and Inferentia (inf1, inf2) instances require the neuron tag so that the cluster autoscaler knows these nodes provide AWS Neuron resources for ML training and inference workloads.
Required tag for Trainium/Inferentia ASGs: * k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron
- Result:
The toolkit check displays all Trainium/Inferentia ASGs and validates they have the required neuron resource tag. Fails if any are missing the tag.
- domino_admin_toolkit.checks.test_aws_cloud.test_asg_missing_tags(domino_api_client, aws_region, skip_karpenter_enabled, k8s_client)
Description: :rtype:
NoneThe cluster autoscaler depends on Auto Scaling Groups (ASGs) tagging for discovery and scaling of nodes. If some tags are missing revisit and evaluate the ASG tags as this could be the reason for scaling problems Not applicable to Karpenter-enabled deployments.
Required tags in AWS:
k8s.io/cluster-autoscaler/enabled
k8s.io/cluster-autoscaler/{cluster name}
Required tags in EKS (1.24+) if scaling up “from 0 nodes”:
k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone = <AWS AZ’s>
k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/domino-node
k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool
Optional tags:
k8s.io/cluster-autoscaler/node-template/resources/smarter-devices/fuse
Note:
Specific tag keys such as “k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/domino-node,” for EKS managed nodegroups are generated from the nodes’ labels via their kubelet configuration. This overrides the tag values set in the autoscaling group. As a result, explicitly setting these tags in the autoscaling group is unnecessary.
Result:
The toolkit check displays all Auto Scaling Groups (ASGs), while also validating them for required tags.
Public Facing KB:
- domino_admin_toolkit.checks.test_aws_cloud.test_unattached_ebs_volumes(ebs_volume_data, aws_region)
- Return type:
- Description:
Check if there are available and unattached EBS volumes on AWS that may represent waste or issues.
- Behavior:
INFO: Reports all unattached volumes ≥100GiB for visibility
WARNING: Highlights volumes older than 7 days (potential cleanup candidates)
FAILURE: Only fails if count exceeds 20 volumes (indicates systemic issue)
This approach reduces noise while still surfacing actionable cost optimization opportunities.
- Failure Conditions:
More than 20 unattached volumes ≥100GiB (indicates systemic issue)
- Troubleshooting Steps:
Review volumes older than 7 days - these are prime cleanup candidates
Check if volumes have appropriate tags (backup, snapshot, etc.)
Verify volumes aren’t legitimate standby/backup storage
Use AWS Cost Explorer to quantify storage costs
- Resolution Steps:
Delete confirmed orphaned volumes via AWS Console or CLI
Add lifecycle policies for automated cleanup of old volumes
Improve volume tagging practices for better tracking
Set up CloudWatch billing alerts for EBS costs
- Required Permissions:
Platform admin access to AWS Console/CLI for volume management