Meta Unleashes Lalama API Running 18x Faster Than Openai: CEREBRAS PARTNERSHIP DELIVES 2,600 Tokens per second

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-Leading Ai Coverage. Learn More

Meta Announced today a partnership with Cerebras Systems to power its new Llama apiOffering developers access to infection speeds up to 18 times faster than traditional gpu-based solutions.

The announsement, made at meta’s inaugural Llamacon Developer Conference in Menlo Park, Positions The Company to Compete Directly With Openai, Anthropicand Google In the rapidly growing ai infererance service market, where developers purchase tokens by the billions to power their applications.

“Meta has Selected Serebras to Collaborate to Deliver The Ultra-Fast Infererance that they need to serve developers through their new llama api,” said Julie Shin Choi, Chafe Marketing Offer AT CEFCER AFFCER AFFCER AF Press briefing. “We at cerebras are really excited to announs our first CSP Hyperscaler Partnership to Deliver Ultra-Fast Infection to All Developers.”

The partnership marks meta’s formal entry into the business of selling ai computation, transforming its popular open-source llam models into a commercial service. While Meta’s llama models have accumulated over One billion downloadsUntil now the company had not offered a first-party cloud infrastructure for developers to build applications with them.

“This is very exciting, even without talking about cerebras specifically,” said james wang, a Senior Executive at Serebras. ” “Openai, Anthropic, Google – Thei Built Ann Entre New Ai Business from Scratch, which is the Ai Inference Business. Developers who are buying ai apps will buy tokens by Billions sometimes. And these are just like the new computer instructions that people need to build ai applications. “

A Benchmark Chart Shows Cerebras Processing Lalama 4 at 2,648 tokens per second, dramatically outpacing competors sambanova (747), groq (600) and 600) and GPU-BASED Services from Google and Osdar Explaining meta’s hardware choice for its new api. (Credit: Cerebras)

Breaking the Speed Barrier: How Cerebras Supercharges Llam Models

What Sets Meta’s Offering Apart is the Dramaatic Speed Increase Provided by Cerebras’ Specialized AI Chips. The Cerebras System Delivers Over 2,600 tokens per second For llam 4 scout, compared to approximately 130 tokens per second for chatgpt and around 25 tokens per second for deepsek, according to benchmarks from Artificial analysis,

“If you just compare on api-to-aapi basis, gemini and gPT, they’re all great models, but they all run at gpu speeds, which is roughly about 100 tokens per second,” WANGENED. “And 100 tokens per second is okay for chat, but it’s very slow for reasoning. It’s very slow for agents. And people are struggling with that today.”

This Speed Advantage Enables Entrely New Categories of Applications that were previously impressed important, include real-temperature agents, conversational low-lateness VOCE SYSTEMS, Interactive CoDE Genment, and Instant Multi-Step Reasoning-All of which require Chaining Multiple Large Language Model Calls that Can Now Be Be Completed In Seconds Rather Than Minutes.

The Llama api represents a significant shift in meta’s ai strategy, transitioning from primarily being a model provider to become a full-service ai infrastructure company. By offering an api service, meta is creating a rev the stream from its ai investments while maintaining its committee to open models.

“Meta is now in the business of selling tokens, and it’s great for the american kind of ai ecosystem,” Wang noted during the press conference. “They brings a lot to the table.”

The api will offer tools for fin-tuning and evaluation, starting with LLAma 3.3 8B ModelAllowing developers to generate data, train on it, and test the quality of their custom models. Meta Emphasizes that It Won’T Use Customer Data to Train Its Own Models, And Models Built Using The LLAMA API API CAN Be Transferred to Other HOSTS – A Clear Difference – A Clear Difference from Approaches.

Cerebras will power meta’s new service through its network of data centers Located throughout North America, Including Facilites in Dallas, Oklahoma, Minnesota, Montreal, and California.

“All of our data centers that serve inference are in north America at this time,” Choi explained. “We will be served meta with the full capacity of cerebras. The work will be balanced across all of these different data centers.”

The business arrangement follows what Choi described as “The Classic Compute Provider to a Hyperscaler” Model, Simlar to how Nvidia Provides Hardware Hardware to Major Cloud Providers. “They are reserving blocks of our computer that they can serve their developer population,” She said.

Beyond Cerebras, Meta has also announced a partnership with Groq To provide fast infection options, giving developers multiples high-performance alternatives beyond traditional gpu-spoken infection.

Meta’s entry into the infererance api market with supermance metrics Openai, Googleand AnthropicBy combining the popularity of its open-source models with dramaatically faster infererance capabilities, meta is positioning itself as a formidable competition in the commercial ai space.

“Meta is in a unique position with 3 billion users, hyper-scale dataceners, and a huege developer ecosystem,” According to Cerebras’ Presentation Materials. The Integration of Cerebras Technology “Helps Meta Leapfrog Openai And Google in Performance by Approximately 20x.”

For Cerebras, this partnership represents a Major Milestone and Validation of its specialized ai hardware approach. “We have been building this wafer-scale engine for years, and we always knew that the technology’s first rate, but ultimately it has to end up as part of someone eelse eelse eelse eelse cles. Final Target from a Commercial Strategy Perspective, and we have finally reacted that Milestone, “Wang Said.

The Llama api is currently available as a limited preview, with meta planning a broader rollout in the coming weeks and months. Developers interested in accessing the ultra-fast llam 4 infeRENCE can request early access by selecting from the serebras from the model options within the llam.

“If you imagine a developer who does not know anything about cerebras trust we’re a relatively small company, they can just click two buttons on meta ‘ Key, select the cerebras flag, and then all of a sudden, their tokens are being processed on a giant wafer-scale engine, “Wang explained. “That kind of having us be on the back end of meta’s whole development ecosystem is just trendous for us.”

Meta’s choice of specialized silicon signals something Profound: in the next phase of ai, it’s not just what your models know, but how quickly they can think itk it. In that future, speed isn’t just a feature – a whole’s the whole point.

Daily Insights on Business Use Cases With VB Daily

If you want to impress your boss, vb daily has you covered. We give you the inside scoop on what companies are doing with generative ai, from regulatory shifts to practical deployments, so you can share insights for maximum roi.

Read our Privacy Policy

Thanks for subscribing. Check out more VB Newsletters here,

An Error Occured.

Breaking the Speed ​​Barrier: How Cerebras Supercharges Llam Models

Leave a Comment Cancel reply

Breaking the Speed Barrier: How Cerebras Supercharges Llam Models