Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Two African nations ban American citizens in diplomatic tit-for-tat following Trump admin move

U.S. Passport Privileges Revoked: African Nations Retaliate Against Trump-Era Diplomacy

In a dramatic diplomatic face-off, two West African countries have chosen to…
Arlington Heights couple says free smoke detector installed by fire department helped save their lives during kitchen blaze

Life-Saving Alert: Arlington Heights Couple Rescued by Free Fire Department Smoke Detector in Kitchen Blaze

An Arlington Heights couple is raising awareness about a straightforward safety measure…
NJ health officials warn of measles exposure at Newark airport

Measles Alert: Newark Airport Visitors Potentially Exposed, Say NJ Health Officials

Health authorities in New Jersey have issued an alert about potential measles…
Somali-Run Day Care in Minneapolis Claims Documents Were Stolen

Minneapolis Somali-Run Day Care Reports Theft of Important Documents

A daycare center operated by a Somali community in Minneapolis, Minnesota, has…
Cincinnati viral beating victim says violent mob started attacking 'like a pack of wolves'

Kohberger’s Plea and Cincinnati Incident Among 2025’s Most Talked-About Legal Cases

This year has been marked by significant stories that sparked national debates…
Wallaby named Rex found safe after escaping from Lots of Love Farm in Williamstown, New Jersey

Runaway Wallaby Rex Safely Returns to New Jersey’s Lots of Love Farm

WILLIAMSTOWN, N.J. — Rex, the elusive wallaby of Williamstown, has been joyfully…
Is anything open on New Year's Eve 2025? See which stores are operating.

New Year’s Eve 2025: Discover Which Stores Are Open for Your Last-Minute Shopping Needs

As the countdown to the New Year begins, countless Americans are likely…
Texas 19-year-old Camila Mendoza Olmos vanishes outside her home on Christmas Eve

Texas Sheriff Confident: Missing Teen Camila Mendoza Olmos’ Remains Likely Discovered

This article addresses the sensitive topic of suicide. If you or someone…
Trump's Foreign Policy Lie Exposed

Revealed: The Truth Behind Trump’s Foreign Policy Claims

On the Redacted podcast, libertarian firebrand Dave Smith exposes how the President’s…
Missing juvenile found during Flagler County stop, sex offender and teen arrested after pursuit

Flagler County Traffic Stop Leads to Arrests of Sex Offender and Teen; Missing Juvenile Safely Located

A gripping encounter with law enforcement unfolded in Flagler County, Florida, as…
Southcom carries out strike

US Military Executes Precision Kinetic Strikes to Neutralize Narco-Terror Convoy at Sea

The U.S. military launched overnight kinetic strikes targeting a convoy of three…
SNAP bans on soda, candy and other foods take effect in five states Jan. 1

New SNAP Rules: Soda and Candy Restrictions Begin January 1 in Five States

Beginning Thursday, residents in five U.S. states who rely on government assistance…