domino_admin_toolkit.checks.test_aws_cloud module

pydantic model domino_admin_toolkit.checks.test_aws_cloud.ASGNeuronTagAnalyzer

Bases: AnalyzerBase

Validates that Auto Scaling Groups have the required neuron tag

Fields:
field required_neuron_tag: str = 'k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron'

The required neuron resource tag

analyze(data)

Core analysis logic

Return type:

list[CheckResult]

name: ClassVar[str] = 'ASGNeuronTagAnalyzer'
pydantic model domino_admin_toolkit.checks.test_aws_cloud.EBSVolumeAnalyzer

Bases: AnalyzerBase

Analyzes unattached EBS volumes for cost optimization opportunities

Fields:
field age_threshold_days: int = 7

Age threshold for warnings

field failure_threshold: int = 20

Volume count threshold for failure

analyze(data)

Analyze EBS volume data for cost optimization

Return type:

list[CheckResult]

name: ClassVar[str] = 'EBSVolumeAnalyzer'
domino_admin_toolkit.checks.test_aws_cloud.asg_data(aws_region)
Return type:

DataFrame

domino_admin_toolkit.checks.test_aws_cloud.ebs_volume_data(aws_region, k8s_client)

Collect EBS volume data

Return type:

DataFrame

domino_admin_toolkit.checks.test_aws_cloud.get_asg_data(aws_region)

Retrieves Auto Scaling Group data from AWS

Return type:

DataFrame

domino_admin_toolkit.checks.test_aws_cloud.test_asg_hwt_matches(domino_api_client, k8s_client, aws_region, skip_karpenter_enabled)
Return type:

None

Description:

Checks that all hardware tiers have a corresponding autoscaling group with the right node-pool tag. Not applicable to Karpenter-enabled deployments.

Result:

The toolkit check displays all hardware tiers and their corresponding autoscaling groups.

domino_admin_toolkit.checks.test_aws_cloud.test_asg_missing_neuron_tags(asg_data, skip_karpenter_enabled)
Return type:

None

Description:

Checks that all Trainium and Inferentia Auto Scaling Groups (ASGs) have the required neuron resource tag for proper cluster autoscaling of ML workloads. Not applicable to Karpenter-enabled deployments.

Trainium (trn1) and Inferentia (inf1, inf2) instances require the neuron tag so that the cluster autoscaler knows these nodes provide AWS Neuron resources for ML training and inference workloads.

Required tag for Trainium/Inferentia ASGs: * k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron

Result:

The toolkit check displays all Trainium/Inferentia ASGs and validates they have the required neuron resource tag. Fails if any are missing the tag.

domino_admin_toolkit.checks.test_aws_cloud.test_asg_missing_tags(domino_api_client, aws_region, skip_karpenter_enabled, k8s_client)

Description: :rtype: None

The cluster autoscaler depends on Auto Scaling Groups (ASGs) tagging for discovery and scaling of nodes. If some tags are missing revisit and evaluate the ASG tags as this could be the reason for scaling problems Not applicable to Karpenter-enabled deployments.

Required tags in AWS:

  • k8s.io/cluster-autoscaler/enabled

  • k8s.io/cluster-autoscaler/{cluster name}

Required tags in EKS (1.24+) if scaling up “from 0 nodes”:

  • k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone = <AWS AZ’s>

  • k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/domino-node

  • k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool

Optional tags:

  • k8s.io/cluster-autoscaler/node-template/resources/smarter-devices/fuse

Note:

Specific tag keys such as “k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/domino-node,” for EKS managed nodegroups are generated from the nodes’ labels via their kubelet configuration. This overrides the tag values set in the autoscaling group. As a result, explicitly setting these tags in the autoscaling group is unnecessary.

Result:

The toolkit check displays all Auto Scaling Groups (ASGs), while also validating them for required tags.

Public Facing KB:

domino_admin_toolkit.checks.test_aws_cloud.test_unattached_ebs_volumes(ebs_volume_data, aws_region)
Return type:

None

Description:

Check if there are available and unattached EBS volumes on AWS that may represent waste or issues.

Behavior:
  • INFO: Reports all unattached volumes ≥100GiB for visibility

  • WARNING: Highlights volumes older than 7 days (potential cleanup candidates)

  • FAILURE: Only fails if count exceeds 20 volumes (indicates systemic issue)

This approach reduces noise while still surfacing actionable cost optimization opportunities.

Failure Conditions:
  • More than 20 unattached volumes ≥100GiB (indicates systemic issue)

Troubleshooting Steps:
  1. Review volumes older than 7 days - these are prime cleanup candidates

  2. Check if volumes have appropriate tags (backup, snapshot, etc.)

  3. Verify volumes aren’t legitimate standby/backup storage

  4. Use AWS Cost Explorer to quantify storage costs

Resolution Steps:
  1. Delete confirmed orphaned volumes via AWS Console or CLI

  2. Add lifecycle policies for automated cleanup of old volumes

  3. Improve volume tagging practices for better tracking

  4. Set up CloudWatch billing alerts for EBS costs

Required Permissions:

Platform admin access to AWS Console/CLI for volume management