Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like
Guards at 'Alligator Alcatraz' beat, pepper-sprayed detainees, lawyer says

DHS Evacuates All Detainees From ‘Alligator Alcatraz’ as Hurricane Threat Forces Emergency Move

All detainees housed at “Alligator Alcatraz,” a migrant detention center in the…
Man dies after carriage horse gets loose in New York City's Central Park, crash

Central Park Carriage Horse Crash Leaves Man Dead After Runaway Incident in NYC

NEW YORK — An 18-year-old man died after being critically injured in…
Coast Guard opens fire on boat smuggling 25 Chinese nationals near Florida after it refused to stop

US Coast Guard Fires on Smuggling Boat Carrying 25 Chinese Migrants Off Florida After Pursuit

The U.S. Coast Guard fired on a boat off the Florida coast…
Florida couple sues fertility clinic after allegedly giving birth to someone else's baby

Florida Couple Settles With Biological Parents in Alleged IVF Embryo Mix-Up Case

A Florida couple who say a fertility clinic mistakenly implanted the wrong…
Mariah Carey, Chloe Bailey, Ali Wong to guest star on final season of 'The Proud Family: Louder and Prouder'

Mariah Carey, Chloe Bailey and Ali Wong Set to Guest Star in Final Season of ‘The Proud Family: Louder and Prouder’

Disney+ has unveiled the lineup of guest stars set to appear in…
Experts debunk Tyler Robinson's ballistics claim: 'Unable to identify is not the same as ruled out'

Prosecutors Grant Limited Immunity to Roommate and Lover of Alleged Charlie Kirk Assassin Tyler Robinson

Robinson defense alleging prosecutor misconduct Criminal defense lawyer Josh Ritter appeared on…
Finland's parliament votes to lift decades-old ban on nuclear weapons in historic NATO defense shift

Finland Lifts Decades-Old Nuclear Weapons Ban in Historic NATO Defense Policy Shift

Finland pushes to join NATO quickly Finnish Ambassador to the U.S. Mikko…
Austin tech leader Joshua Baer identified as victim of Texas plane crash after jet caught fire along highway

Austin Tech Leader Joshua Baer Killed in Texas Plane Crash After Jet Catches Fire on Highway

Joshua Baer, founder of Capital Factory and one of Austin’s most prominent…
Social Security recipients face looming benefit cuts. Can the program be saved?

Social Security Benefit Cuts Loom: Can Congress Save the Program in Time?

Social Security is heading toward a major financial deadline: its trust fund…
LaGuardia shuts down runway for second time in weeks after pavement issue resurfaces

LaGuardia Closes Runway Again as Recurring Pavement Problem Disrupts Flights

Sinkhole at LaGuardia Airport forces runway shutdown amidst busy holiday travel A…
Alex Murdaugh's lawyer vows to dismantle infamous kennel video as defense weighs another turn on the stand

Alex Murdaugh Lawyer Targets Key Kennel Video as Defense Considers Another Testimony Twist

Alex Murdaugh’s legal team says it is prepared to confront the key…
Air Force identifies 8 crew members killed in B-52 Stratofortress crash at Edwards Air Force Base

Air Force Identifies Eight Service Members Killed in B-52 Stratofortress Crash at Edwards Air Force Base

Officials on Wednesday released the names of the eight people killed in…