domino_admin_toolkit.checks.test_hardware_tier_sizing module

class domino_admin_toolkit.checks.test_hardware_tier_sizing.HardwareTierRow

Bases: TypedDict

cores_limit: float | None
cores_requested: float | None
cpu_ratio: float | None
gpu_count: int | None
id: str
max_simultaneous_executions: int | None
memory_limit_gib: float | None
memory_ratio: float | None
memory_requested_gib: float | None
name: str
node_pool: str
overprovision_instances: int | None
pydantic model domino_admin_toolkit.checks.test_hardware_tier_sizing.HardwareTierSizingAnalyzer

Bases: AnalyzerBase[HardwareTierRow]

Flags hardware tiers whose configured CPU/memory limit-to-request ratio sets the latent conditions for node-pressure failures.

Fields:
field fail_ratio: float = 5.0

Flag a tier as FAIL when its CPU or memory limit divided by its request meets or exceeds this value. At 5x and above, only a few executions bursting toward their limits simultaneously can exhaust a node’s real capacity, triggering CPU throttling, out-of-memory kills, or node-pressure evictions for unrelated pods sharing the node.

field warn_ratio: float = 2.0

Flag a tier as WARN when its CPU or memory limit divided by its request meets or exceeds this value. At 2x and above, a single execution on this tier can consume substantially more CPU or memory than the cluster scheduler reserved for it — review whether that headroom is intentional for the workloads the tier hosts.

analyze(data)

Analyzes one row and returns a list of CheckResult instances.

Return type:

list[CheckResult]

Args:

data: One row dict (TRow). The Runner calls this once per DataFrame row.

Returns:

List[CheckResult]: A list containing the results of the analysis.

Raises:

NotImplementedError: If this method is not implemented by subclasses.

name: ClassVar[str] = 'HardwareTierSizingAnalyzer'
domino_admin_toolkit.checks.test_hardware_tier_sizing.hardware_tier_data(domino_api_client)

Hardware tier configuration projected from the v1 Domino REST API.

Skips on (a) 404 from older deployments without the v1 API and (b) empty tier list. Any other exception (including envelope-shape ValueError) propagates as a real test failure — silent skips on misconfiguration cost a round-trip on PR #1371.

Return type:

DataFrame

domino_admin_toolkit.checks.test_hardware_tier_sizing.test_hardware_tier_sizing(hardware_tier_data, runner)
Return type:

None

Description:

Audits configured Domino hardware tiers to answer: are these tiers right-sized for the workloads they host? Each tier defines a CPU and memory request (what the Kubernetes scheduler reserves on a node when placing a pod) and a CPU and memory limit (the ceiling the kernel enforces at runtime). When the limit is substantially larger than the request, a single execution on that tier can consume more resources than the cluster reserved for it. Modest headroom is normal and often intentional; large multipliers create the conditions for node-pressure failures — CPU throttling, out-of-memory kills, or evictions of unrelated pods — when multiple executions burst toward their limits on the same node at the same time.

This check lists every active tier with its CPU and memory request-to-limit ratios and flags those that exceed the configured thresholds (default WARN at 2x, FAIL at 5x). Sourced from the versioned Domino REST API (GET /api/hardwaretiers/v1/hardwaretiers).

Note: this check is based on the request and limit values configured on each tier. It does not currently inspect the per-tier overcommit toggles available in the admin UI; those determine whether the limit can actually be enforced at runtime on this deployment.

Failure Conditions:
  • FAIL when a tier’s CPU or memory limit-to-request ratio meets or exceeds the FAIL threshold (default 5x). At this level, a small number of executions bursting simultaneously can exhaust node capacity and cause workloads to be throttled or evicted.

  • WARN when the ratio meets or exceeds the WARN threshold (default 2x) but is below FAIL. Headroom at this level may be intentional for genuinely bursty workloads; verify each flagged tier is configured that way deliberately.

  • ERROR when a tier is missing a CPU or memory request, or has a non-positive request value. A tier in this state cannot be analyzed; check the tier configuration in the admin UI. (A limit of 0 means “no separate limit configured” and is treated as equal to the request — no overcommit possible, PASS.)

Troubleshooting Steps:
  1. For each tier flagged WARN or FAIL, open it in the Domino UI (Admin -> Hardware Tiers -> [tier name]) and review the CPU “Requested” / “Limit” and Memory “Requested” / “Limit” fields against the workloads typically run on this tier.

  2. Ask: do the workloads on this tier genuinely need to burst above their request? Common reasons to permit overcommit include batch jobs with brief peak demand, mixed-density workloads, or short-lived interactive sessions. Common reasons NOT to permit it include long-running production model APIs, scheduled jobs with predictable steady-state usage, or any tier hosting shared platform services.

  3. Confirm the tier’s Node Pool maps to a real cluster node group (cross-check with the test_aws_cloud.py::test_asg_hwt_matches result if available in this bundle).

  4. If the tier is shared with platform components, check test_k8s_platform_sizing.py for actual usage trends in the relevant namespaces.

Resolution Steps:
  1. For tiers where overcommit was unintentional: bring the CPU and memory Limit values closer to the Requested values via the Domino UI (Admin -> Hardware Tiers -> edit). Setting Limit equal to Request eliminates overcommit on that tier entirely.

  2. For tiers where overcommit is deliberate and well-understood: document the rationale (e.g. in a runbook the on-call team can reference) and either accept the WARN as an informational signal or, if appropriate for this deployment, raise the analyzer’s warn_ratio threshold via the toolkit’s analyzer configuration.

Required Permissions:
  • Platform admin access to the Domino UI (Admin -> Hardware Tiers).

  • Read access to the versioned hardware tier REST API (granted automatically to in-cluster toolkit pods via OAuth S2S).

See also:
  • test_karpenter_capacity.py - if a flagged tier maps to a saturated pool, runtime symptoms surface here.

  • test_k8s_platform_sizing.py - actual usage vs requests for the domino-platform-* namespaces (counterpart when the tier hosts platform components).

  • test_aws_cloud.py::test_asg_hwt_matches - confirms each tier’s nodePool maps to a real autoscaling group.