Meta Cheated on Ai Benchmarks and It’s a Glimpse INTO A New Golden Age

Meta Cheated on An Ai Benchmark, and that is Hilaous. According to kylie robison at the verge The Suspicions Started Percolating after Meta Released Two New AI Models Based on Its Lalama 4 Large Language Model Over the weekend. The new models are scout, a smaller model intended for quick queries, and maverick, which is meant to be a supeer efficient rival to more cell-known models like Opeenai’s GPT-4O (The harbinger of our miyazaki apocalypse,

In the blog post announcing them, meta did what everyone ai company now does with a major release. They dropped a whole bunch of highly technical data to brag about how meta’s ai was smarter and more efficient than models from companies batter associateed with These release posts are always mired in Deeply Technical Data and Benchmarks that are hugely beneficial to Researchers and the most ai obsessive, but kind of useless for the rest of us. Meta’s Announcement was no different.

But planty of ai obsessives immediatily noticed one shocking benchmark result meta highlighted in its post. Maverick Had An Elo Score of 1417 in Lmarena. Lmarena is an open-source collaborative benchmarking tool where users can Vote on the best output. A higher score is better and maverick’s 1417 put it in the number 2 spot on lmarena’s leaderboard, just Above GPT-4o and Just Below Gemini 2.5 Pro. The Whole Ai EcoSystem rumbled with surprise at the results.

Then they started digging, and quickly noted that in the fin print, meta Had Acknowledded the Maverick Model Crushing on Lmarena was a Tad Different Than the Version users have access to. The company had programmed this model to be more chatty than usual. Effectively it charged the Benchmark Into Submission.

It does not seem like lmarena was pleased with the charm offensive. “Meta’s Interpretation of our policy did not match what we expect from model provides,” it said in a statement on x“Meta should have made it clearer that ‘lLAMA-4-MAVERICK-03-26-26-AXPERIMENTAL’ was a customized model to optimize for human preference. To reinforce our committee to fair, reproducible evaluations so this confusion does not Occur in the future. “

I love lmarena’s optimism here became gaming a benchmark feels like a right of passage in consumer technology and I suspect this trend will continue. I’ve ben covering consumer technology for over a decade, I on only ran one of the More Extensive Benchmarking Labs in the Industry, and I have Seen Planty of Phone and Laptop Makers attempt all Kinds of Tricke Juice their scores. They Messed with Display Brightness for Better Life and Shipped bloatware-free versions of laptops to reviewers to get better performance scores.

Now ai models are gotting more chatty to juice their scores too. And the reason I suspect this won’t be the last carefully cultivated score is that right now these companies are desperate to distinguish their large language models models from one another. If every model can help you write a shitty English paper five minutes before class then you’ll need another reason to distinguish your preference. “My Model Uses Less Energy and Accomples The Task 2.46% Faster,” Might not see the biggest brag to all, but it matters. That’s stil 2.46% faster than everything all.

As these ais continue to mature into actual consumer-facing products we’ll start Seeing more Benchmark Bragging. Hopely, we’ll see the other stuff too. User interfaces will start to change, goofy stores like the explore GPT Section of the chatgpt app will become more common. These companies are going to need to prove why their models are the best models and benchmarks alone won’t do that. Not when a chatty bot can game the system so easily.

Leave a Comment