Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Man charged in NYC antisemitic stabbing is released on bail

Controversial Release: NYC Antisemitic Stabbing Suspect Out on Bail Sparks Outrage

A young man, aged 23, has been charged with a hate crime…
Illinois attorney general sues to prevent Trump admin. from cutting $600M in public health funds from 4 Democratic-led states

Illinois Attorney General Fights to Protect $600M in Public Health Funds from Trump Administration Cuts Targeting Democratic-Led States

CHICAGO (WLS) — In a significant legal move, Illinois Attorney General Kwame…
Latest FBI neighborhood canvass in Guthrie case could mean feds have 'digital evidence:' former agent

FBI’s New Neighborhood Canvass in Guthrie Case Suggests Potential Digital Evidence Breakthrough, Says Ex-Agent

The streets and neighborhoods around Nancy Guthrie’s Tucson, Arizona, residence became the…
Iranian brutality: Nobel laureate fighting for life after barbaric assault at notorious prison

Iranian Nobel Laureate Faces Life-Threatening Ordeal: Brutal Assault in Infamous Prison Sparks Global Outrage

The Norwegian Nobel Committee has urgently appealed to Iran to cease its…
Off-duty ICE agent takes down armed man firing rifle at LA apartment complex

Los Angeles Community Proposes Siren System to Alert Residents of ICE Presence

In a Los Angeles neighborhood, a local group is deploying sirens as…
Trump meets Netanyahu, says he wants Iran deal but reminds Tehran of ‘Midnight Hammer’ operation

Trump Engages with Netanyahu, Expresses Desire for Iran Deal while Citing ‘Midnight Hammer’ Operation

Trump, Netanyahu meet at White House amid Iran talks Peter Doocy from…
Chicago crime: Police investigating business burglaries on North Southport Avenue in Lakeview, including Little Goat Diner

Police Probe String of Business Break-Ins on North Southport Avenue, Including Little Goat Diner in Lakeview

CHICAGO (WLS) — Authorities in Chicago are delving into a series of…
Possible tattoo seen in Nancy Guthrie video may help ID subject, former profiler says

Unveiling Clues: How a Tattoo in Nancy Guthrie’s Video Could Crack the Case, Expert Profiler Reveals

TUCSON, Ariz. — In a breakthrough development in the Nancy Guthrie case,…
Under Trump’s Direction, DOT Moves to Bar Unvetted Foreign Drivers From U.S. Trucking Industry

Trump Administration Enforces Stricter Regulations to Exclude Unvetted Foreign Drivers from U.S. Trucking Industry

The Trump administration announced a significant safety overhaul on Wednesday, aimed at…
DCPS to 'adjust' relocation plans after prospective developers back out

DCPS to Revise Relocation Strategy Following Withdrawal of Prospective Developers

In a recent development, the Duval County Public Schools (DCPS) headquarters on…
DOT closes major commercial trucking loophole blamed for illegal immigrants causing fatal crashes

DOT Shuts Down Key Trucking Loophole Linked to Fatal Crashes Involving Undocumented Immigrants

The Department of Transportation (DOT) has taken decisive action to close a…
Illinois primary elections 2026: Early voting in Chicago starts Thursday with new supersite at State, Adams streets

Illinois 2026 Primary Elections: Chicago Launches Early Voting with New State-Adams Supersite

In Chicago, early voting has officially kicked off, marking a significant change…