It's Been a Minute

Scaling GPUs and Humility

I work on Amazon SageMaker HyperPod, a service that helps customers build and manage large-scale GPU clusters for training and running foundation models. If you've been following the AI space, you've seen the headlines: xAI announcing 100,000 GPU clusters, Meta unveiling massive training infrastructure. These are incredible feats of engineering. What you don't often see in those announcements are the failed attempts. The nodes that silently degrade. The GPU that reports healthy but corrupts every gradient it touches. The network interface that drops packets just infrequently enough to evade a simple health check.

In 2024, I helped scale HyperPod to its largest GPU cluster at the time. My work focused on something decidedly unglamorous but deeply important: detecting faulty nodes during the scaling of a cluster and replacing them with healthy ones before customers ever noticed. This meant pulling fault signals from NVIDIA SMI, DCGM, health monitoring agents, and network telemetry — building an understanding of what "unhealthy" truly looks like in a system with thousands of GPUs — and then orchestrating replacements so that the cluster could continue growing without stopping.

Novel scaling mechanisms have since emerged. But at that moment, this was what allowed us to reach a milestone. I'm proud of that work, not because it was flashy, but because it was careful. In infrastructure, careful is the highest compliment.

Giving Customers Visibility

One of the things I've come to believe deeply is that infrastructure should not be a black box. If your node is unhealthy, you deserve to know — and you deserve to know why.

That belief became the EventBridge integration for HyperPod. We built a system that collects faults from NVIDIA SMI, DCGM, our health monitoring agent, and network down events, and delivers them to customers as near-real-time notifications through Amazon EventBridge. Cluster status transitions, node health changes, automatic replacements during recovery — all of it surfaced as events that customers can act on. You can write simple EventBridge rules to trigger automated responses: page your on-call, spin up a replacement workflow, log it to your monitoring system.

It is the kind of feature that doesn't make headlines but fundamentally changes the relationship between a customer and their infrastructure. Instead of wondering "why did my training job fail at hour 47?", you get a timeline of exactly what happened, when it happened, and what the system did about it.

Persistent Storage for ML Workloads

Training a large model can take days, sometimes weeks. If your storage disappears when a pod restarts or a node gets replaced, you lose checkpoints, datasets, and artifacts that took hours to produce. This is the problem we solved with EBS CSI driver support for HyperPod.

My work here was on the foundational technology. I partnered closely with teams at EBS to enable cross-account volume attachment and support serverless volume attachment for managed services like HyperPod. These are capabilities that didn't exist before — the ability for a managed service to dynamically provision, attach, and manage EBS volumes across account boundaries, all through standard Kubernetes persistent volume claims. I also worked on the EC2 Volumes console page to bring this experience to the UI.

For customers, this means training workloads get persistent storage for datasets, model checkpoints, and shared artifacts that survives pod restarts and node replacements. For inference workloads, it means provisioning model storage, caching volumes, and maintaining event logs — all dynamically, all through Kubernetes-native workflows. You can resize volumes without service disruption. You can snapshot for backup and recovery. You can encrypt with your own KMS keys.

If you're curious about the technical details, the AWS blog post I co-authored covers the full picture, including customer managed key support and the architecture behind it.

A Promotion and Gratitude

In the middle of all of this, I got promoted. L5 to L6 SDE at HyperPod.

I want to take a moment here, because promotions are rarely the work of one person. Thank you to Roshani Nagmote, Pavan Kumar Sundar, and Caesar Chen for helping me grow, for finding value in my work when I sometimes couldn't see it myself, and for creating the conditions where good engineering is recognized. And thank you to the entire HyperPod Data Plane team for giving me a space to grow — a team that trusted me with hard problems, celebrated the wins, and made the long debugging sessions feel a little less lonely.

I don't take this lightly. If you've ever worked in a large organization, you know that a promotion is not just about the technical work. It's about trust. It's about your peers and managers saying: we believe you can operate at a higher altitude. I am grateful for that trust, and I'm eager for the next step of this journey.

What's Next

I started this blog because I felt restricted by traditional social media. Two years later, that feeling hasn't changed — if anything, it's intensified. The world has gotten louder, the AI discourse more breathless, the LinkedIn posts more performative. I still want a space to write freely, to be technical when the subject demands it and human when the moment calls for it.

So I'll be writing more. About infrastructure, about the things that break at scale, about the quiet engineering that holds the loud demos together. And about the rest of life too — the parts that don't fit neatly into a "What's New" post.

It's good to be back.

With warmth,

Aditya

"We are what we repeatedly do. Excellence, then, is not an act, but a habit."

— Will Durant, paraphrasing Aristotle