AWS and Cerebras Partner on AI Inference Speed and Cloud Performance

Amazon Web Services

Amazon Web Services, a subsidiary of Amazon, has announced a strategic partnership with Cerebras Systems, which aims to provide some of the fastest AI inference performance for generative AI applications and large language model workloads.

The technology stack, which will be available on Amazon Bedrock within AWS data centers, includes AWS Trainium-powered servers, Cerebras CS-3 systems, and Elastic Fabric Adapter networking. The technology stack is expected to provide significant performance for AI inference for enterprises.

Later this year, AWS is expected to provide leading open-source large language models and Amazon Nova, running on Cerebras hardware, to provide customers with flexibility to scale generative AI applications.

“Inference is where AI delivers real value to customers, but speed remains a critical bottleneck for demanding workloads like real-time coding assistance and interactive applications,” said David Brown, Vice President, Compute & ML Services, AWS. “What we’re building with Cerebras solves that: by splitting the inference workload across Trainium and CS-3, and connecting them with Amazon’s Elastic Fabric Adapter, each system does what it’s best at. The result will be inference that’s an order of magnitude faster and higher performance than what’s available today.”

Also Read: Upwind Partners with Microsoft on Azure Security Environments

“Partnering with AWS to build a disaggregated inference solution will bring the fastest inference to a global customer base,” said Andrew Feldman, Founder and CEO of Cerebras Systems. “Every enterprise around the world will be able to benefit from blisteringly fast inference within their existing AWS environment.”

Introducing Disaggregated AI Inference

The combined Trainium + CS-3 architecture introduces a technique known as inference disaggregation, which separates AI inference into two core stages: prompt processing (“prefill”) and output generation (“decode”).

These two processes have very different computational requirements:

Prefill: Highly parallel and compute-intensive, requiring moderate memory bandwidth.

Decode: Sequential and memory-bandwidth heavy, often accounting for the majority of inference time since tokens must be generated one after another.

By assigning each stage to hardware optimized for those specific tasks, the new architecture improves performance and efficiency. Trainium is optimized for the prefill stage, while the Cerebras CS-3 system handles decode operations. High-bandwidth connectivity through Elastic Fabric Adapter (EFA) enables low-latency communication between the two components.

The platform is built on the AWS Nitro System, ensuring the same security, isolation and operational reliability that AWS customers expect from its cloud infrastructure.

Trainium and CS-3 Power Next-Generation AI Workloads

AWS Trainium is Amazon’s purpose-built AI chip designed to deliver scalable performance and cost efficiency for both model training and inference across generative AI workloads. Several major AI organizations, including Anthropic and OpenAI, have committed to using Trainium infrastructure. Anthropic has selected AWS as its primary training partner, while OpenAI is expected to utilize 2 gigawatts of Trainium capacity through AWS infrastructure to support advanced AI workloads and frontier models.

Meanwhile, the Cerebras CS-3 platform is designed to deliver extremely high inference performance, offering thousands of times more memory bandwidth than leading GPU systems. This performance advantage is particularly important as modern reasoning models generate more tokens per request while processing complex queries.

Organizations such as Cognition Labs and Mistral AI are already using Cerebras technology to speed up computationally intensive AI computing, particularly in areas such as agentic coding systems, where inference speed is critical for developer productivity.

This partnership between AWS and Cerebras will enable organizations to get high-performance AI inference infrastructure that is required for real-time generative AI applications, advanced reasoning systems, and next-generation AI-powered software experiences.