GPU Usage Optimization: The Hidden Cost of Poor Job Scheduling
Adoption of artificial intelligence (AI) across businesses is moving faster than ever. To gain a competitive advantage, businesses are increasingly experimenting with custom machine learning (ML) models to deliver improved user experiences and streamlined operations. However, training, fine-tuning, using reinforcement learning, or evaluating model performance require access to expensive Graphics Processing Units (or GPUs) for accelerated compute.
Leaders often hear from their teams that they lack sufficient GPUs to make progress on their AI/ML goals, or need more compute to accelerate their development. The truth? Most organizations statically allocate GPU resources, and lack central visibility into actual usage across teams, clusters, cloud providers and even on prem deployments. This approach leads to blind spots, and inefficiencies that result in pockets of idle capacity being paid for but not used across organizations.
If your company reserves GPU capacity for AIML initiatives, you may be letting millions of dollars of investment go to waste each year.
In this blog post, we explore the reality behind GPU usage, common root cause, and solutions for leaders with AI/ML initiatives to achieve 2-3x job scheduling velocity to accelerate their research cycles and deliver on their goals faster.
Most GPUs Are Underutilized
Based on first-hand experience from managing GPU infrastructure at one of the world’s largest technology companies, we found that not only were GPUs sitting idle across the company, but leaders struggled to quantify the cost of idle resources in terms of real dollars being left on the table,. As we talk to more and more businesses across industries, we hear that this challenge is not a unique challeng, but rather the norm. Businesses, on average, achieve 40-60% usage of their GPUs.
The numbers tell the story:
- Recent Academic research confirms multi-tenant clusters average ~50% utilization[1]
- Over 75% of companies report peak utilization below 70%, and only 7% exceed 85% [2]
Breaking down the actual cost
Let’s dive a bit deeper into what the cost of underutilized GPUs actually looks like.
- A single H100 GPU can cost between $2-8/hour from cloud providers, or $17-70K per GPU per year.
- 100 GPUs at 60% utilization = Up to $2.8M wasted annually on idle compute
McKinsey projects that by 2030, data centers (providers of GPU compute capacity) are expected to invest $5.2-7.9T in AI infrastructure development. A clear sign that demand is not slowing down for companies to continue purchasing more and more GPUs. But if organizations fail to achieve high utilization of their existing resources why are they continuing to buy more? [3]
The Root Cause: Siloed Allocation
Most businesses we talk to say they manually allocate GPU resources across teams. This manual allocation process requires significant manual effort to decide who gets what, and often use rudimentary tools like spreadsheets to track their allocations. More over, this approach silos capacity into blocks of compute that are unusable by other teams when the GPUs are not in use.
Let’s take a look a typical GPU allocation and usage pattern:
The typical static allocation pattern without intelligent job scheduling and idle capacity sharing:
- Team A gets a dedicated GPU slice
- Team B gets their own slice
- Team A maxes out usage of their allocation with more jobs waiting in the queue while Team B’s resources sit idle and unused
- Next day, roles reverse - Team A now has no demand for GPUs and Team has more jobs ready to be run than the GPU capacity statically allocated to them.
- Neither can access the other's unused capacity
- Both request more GPUs because jobs are taking longer to start when they have a demand for GPUs
Understanding GPU Utilization Challenges & the Hidden Costs
It’s not just that your GPU spend is being wasted. This pattern of siloed allocations has additional hidden costs that compound the impact on throughput and velocity.
- Resource hoarding: Teams hold capacity "just in case" and don’t have predictable workload schedules, making it difficult to claw back capacity.
- Slower experimentation: Rigid allocation creates longer job queues. Longer wait times results in slower experiments ran, and ultimately slower progress.
- Large multi-node job starvation: Static scheduling forces teams to decide when to trade off running small experiments to consume their GPUs or let their GPUs sit idle while the wait for enough jobs to finish to start large scale distributed training jobs.
Curious how your company compares? Contact us for a free assessment.
GPU Utilization Best Practices for AI Teams
High-performing organizations that place an empathis on GPU efficiency achieve daily cluster utilization of 85 - 95%, resulting in tangible, real-world business outcomes. Or put another way, teams that implement GPU efficiency strategies see a 30-40% reduction in GPU costs while accelerating experimentation by 2-3x. We saw this transformation first hand at one of the largest Fortune 100 companies.
What do these organizations focus on to achieve high utilization?
- Centralized usage visibility for both executives and managers across every cluster with reserved GPU capacity regardless of where it’s hosted (cloud service providers, on prem data centers, etc)
- Enable actionable insights using AI tools that help them better understand how to drive improved efficiencies from better allocation strategies, optimized job configuration recommendations, and more
- Mechanism to share idle capacity across teams while retaining SLA guarantees that teams will always be able to access the resources they’ve been allocated
- Autonomous handling of GPUs that go bad or encounter hardware errors so that engineers spend less time managing infrastructure and more time supporting applied scientists.
Key Metrics to Track
| Metric | What It Measures | Target |
|---|---|---|
| GPU usage % | Percentage of time that the GPUs were actually used. This tells you how what % of time you actually used your GPUs. | Above 80% |
| Average Idle GPUs | Number of GPUs that went unused. This can be used to identify either the amount of capacity that could be scaled down or reallocated. | Less than 10-20% of total resevered capacity |
| Demand vs Supply | The number of GPUs requested for queued & running jobs vs the supply of GPUs available. This is a proxy metric to track if your scheduling systems are working efficiently, proactively spot when usage may drop, and/or if you need more GPUs based on actual demand. | Demand > Supply |
| Wait Time | The average time it takes for jobs to start. The longer the time, the longer scientists are waiting to view the results of their experiments | Under 30 min |
The Centralized Allocation & Scheduling Model
Teams often overestimate their true GPU needs. Instead of buying new GPUs each time a new AI initiative arises, leading organizations dynamically shift existing capacity where it’s needed most. The net result is higher job throughput, reduced job wait times, increased usage, and human hours required to sit in the middle and manually allocate resources.
In this model, teams that have allocated resources have a guaranteed SLA that they will be able to access their allocated GPUs. Without this, teams are reluctant to opt-in to the idea of sharing their resources if they don’t have a guarantee that they will have first priority once their demand spikes.
For example, scientists and machine learning engineers can leverage certain types of jobs that either use their exact allocated resources, or can burst beyond what they’ve been allocated to run more jobs:
- Reserved: Guaranteed capacity for critical workloads
- Elastic: Preemptible access to idle resources for experiments
Why Centralized Dynamic Scheduling Works
Effective scheduler & allocation policies address key requirements: (1) fairness, (2) fragmentation reduction (3) throughput, and (4) predictability.
Static policies optimize for one; dynamic scheduling addresses all four.
Key Advantages:
- Dynamic capacity pools: All GPUs and their usage, across any GPU provider automatically become visible to the scheduler
- Topology-aware scheduling & seamless job preemption: Workloads get dynamically scheduled based on priority, and intelligently placed in pools of compute with the lowest latency for distributed training optimization. Elastic workloads are gracefully preempted when needed.
- Algorithmic allocation: Resources are shifted in real-time based on budgets, policies, and actual usage to ensure teams have guaranteed capacity at the right time with no manual action or decisions required
The ROI
The math:
- 100 GPUs at $4/hour = $3.5M/year
- 60% → 85% utilization = $875K recovered annually
- Plus: faster iteration, more experiments, less platform burden
Getting Started
- Measure: Deploy monitoring. What's your actual utilization? Queue times? Idle capacity by team?
- Classify workloads: Reserved (production, SLA) vs. Elastic (experiments, dev)
- Centralize scheduling: Dynamic allocation with topology-aware placement and intelligent preemption
- Iterate: With data, answer: Do we need more GPUs, or better utilization?
The Bottom Line
When adopting custom AI solutions to deliver on critical business objectives, GPU infrastructure is one of the largest technology investments that organizations make.
The winners of the AI revolution aren’t necessarily the ones who buy the most GPU compute. The winners will be the companies that extract the most value from every GPU hour they invest in and accel at AI training efficiency. The tools exist. The question is whether you'll adopt them to unlock more innovation, reduce GPU costs, and improve your machine learning infrastructure.
Want to see how your GPU utilization compares? Contact us for a free infrastructure assessment.
- [1] (Mamirov, 2025)
- [2] (ClearML 2024)
- [3] (McKinsey)