Achieving 97% GPU Utilization on Google Cloud with Sycomp Intelligent Data Storage Platform

By: Scott Fadden, Lead HPC Architect
|
09/30/2025
Image

Introduction

As an HPC architect, I know that GPUs are a valuable resource—highly specialized, immensely powerful, and capable of extraordinary feats. I also know that GPUs don’t exist in a bubble, they’re only as effective as the systems that support them. For example, if your GPUs can’t access data fast enough, they sit idle, wasting time, compute, and investment and money.

That was the challenge presented by a global AI R&D leader.  They asked us to prove that in the cloud, we could keep their GPUs fully supplied with data under the most demanding conditions.

The Challenge

This wasn’t a theoretical exercise. The customer required validation under MLPerf Storage v1.0, the industry-standard benchmark for AI/ML storage performance. Their criteria included:

  • Sustained ≥24 GiB/s throughput per node
  • Successful “train_au_meet_expectation: success” result using the Unet3D dataset
  • High IOPS and sequential throughput for both training and inference
  • Future scalability across multi-node training clusters
  • Operation on Google Cloud, despite the absence of native GPU Direct support

In short, we needed to deliver supercomputer-class performance in a public cloud environment.

The Sycomp Solution

To meet this challenge, we deployed the Sycomp Intelligent Data Storage Platform on Google Cloud, powered by IBM Storage Scale. The architecture included:

  • Compute: A single A3 Ultra instance with H100-class GPUs
  • Storage: Five-node C3 cluster with Hyperdisk Balanced volumes
  • Software orchestration: IBM Storage Scale

This configuration was designed to ensure GPUs could operate at full potential without waiting on data delivery.

Benchmarking Results

The MLPerf Storage v1.0 benchmark was executed on a single A3 Ultra compute node. The system achieved:

  • 23 GiB/s sustained throughput
  • “train_au_meet_expectation: success” with the Unet3D dataset
  • 157,000 IOPS on 4KiB random reads
  • 28.5 GiB/s on 64KiB sequential reads
  • 97% GPU utilization—proving that storage was no longer the bottleneck

Why It Matters

For the customer, this wasn’t just about hitting benchmark numbers—it was about confidence. Confidence that their infrastructure could support ambitious AI workloads without compromise. With MLPerf validation, they can move forward with training pipelines in Google Cloud, assured of performance and scalability.

For HPC architects, the takeaway was clear: cloud environments can achieve HPC-class performance when designed correctly. Storage and data movement are just as critical as GPUs themselves. By engineering a balanced system, we enabled researchers to focus on innovation—not infrastructure.

Lessons Learned

  • Data pipelines are critical: GPUs are only as fast as the data delivered to them.
  • Cloud is ready for HPC: With the right design, performance and scalability are achievable.
  • Benchmarking validates outcomes: MLPerf provided a transparent, trusted framework to prove real-world capability.

At the end of the day, the goal wasn’t simply to achieve 97% GPU utilization—it was to empower a world-class research team to trust their infrastructure and accelerate their discoveries. When GPUs stay fully supplied, innovation happens at full speed.

Learn More

To explore the full benchmarking methodology, architecture, and performance validation process, read Sycomp's technical brief on MLPerf Storage benchmarking.

It provides in-depth insights into how the Sycomp Intelligent Data Storage Platform delivers HPC-class performance in the cloud.