VAST Data has introduced a redesigned AI inference architecture in collaboration with NVIDIA to address the demands of long-lived, agentic AI workloads, where performance is increasingly driven by how efficiently inference context is stored and shared rather than raw GPU compute. Built on NVIDIA BlueField-4 DPUs and Spectrum-X networking, the new approach runs the VAST AI Operating System natively on DPUs, collapsing traditional storage tiers and enabling shared, pod-scale key-value (KV) cache with deterministic access. This design removes client-server bottlenecks, reduces data copies, and lowers time-to-first-token as concurrency increases, while VAST’s Disaggregated Shared-Everything architecture provides a globally coherent context namespace across nodes.
Also Read: BigBear.ai Completes $250M Ask Sage Acquisition to Expand Secure Generative AI Capabilities
“Inference is becoming a memory system, not a compute job. The winners won’t be the clusters with the most raw compute – they’ll be the ones that can move, share, and govern context at line rate,” said John Mao, Vice President, Global Technology Alliances at VAST Data. Kevin Deierling, Senior Vice President of Networking, NVIDIA, added, “Context is the fuel of thinking. Just like humans that write things down to remember them, AI agents need to save their work so they can reuse what they’ve learned,” highlighting the platform’s role in enabling scalable, secure, and production-ready agentic inference.






















