Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like
NYPD cops to pull 12-hour shifts for July 4, World Cup and America 250: 'Unprecedented demands'

Unprecedented Demands: NYPD Enforces 12-Hour Shifts for July 4th, World Cup, and America 250 Celebrations

This summer, New York City is gearing up for an unprecedented influx…
Video shows man kicking American flag in Buena Park on Memorial Day

Caught on Camera: Man Disrespects American Flag in Buena Park on Memorial Day

On Memorial Day, a man suspected of vandalism was caught on camera,…
Florida deputies solve cold case murder after new DNA evidence and witnesses emerge, police say

Florida Deputies Crack Decades-Old Murder Case with Breakthrough DNA Evidence and New Witness Testimonies, Authorities Report

A decade-long mystery surrounding a murder case has finally been unraveled, bringing…
Nancy Pelosi's brutal take on AOC ex-aide Saikat Chakrabarti

Nancy Pelosi Shares Candid Thoughts on Former AOC Aide Saikat Chakrabarti

Nancy Pelosi has upped the ante in the fiercely competitive battle to…
'Hell on wheels' killer Mackenzie Shirilla whined to mother in jailhouse call over her iPad, prison food

Inside the Jailhouse Drama: Mackenzie Shirilla’s Complaints About iPad and Prison Food

Mackenzie Shirilla, infamously known as the “Hell on Wheels” murderer, expressed dissatisfaction…
Chicago's Goodman Theatre celebrating centennial: 100 years of stage excellence and innovation

Celebrate a Century of Stage Brilliance: Goodman Theatre’s 100 Years of Innovation in Chicago

The renowned Goodman Theatre in Chicago is marking its 100th anniversary, celebrating…
US Attorney Andrew Boutros had 'personal contact' with Broadview Six grand jury, atty. Christopher Parente tells Judge April Perry

US Attorney Andrew Boutros Engaged Directly with Broadview Six Grand Jury, Attorney Christopher Parente Informs Judge April Perry

CHICAGO – The repercussions of a dismissed federal case involving six protestors…
US launches new strikes in Iran targeting military site that posed threat to troops, commercial shipping : report

U.S. Strikes Hit Iranian Military Site to Safeguard Troops and Commercial Shipping: Report

In a significant development, the U.S. military executed fresh airstrikes in Iran…
Father admits leaving handgun within reach of young daughters before toddler fatally shot baby sister

Father Confesses to Leaving Handgun Accessible, Resulting in Tragic Shooting Incident Involving Young Daughters

In a tragic turn of events in Wichita, Kansas, a father has…
Missing American’s husband had 'spotty' cell service during 8-hour trek to report disappearance: telecom boss

Mystery Deepens: Sailboat Tracking Disappears for 11 Hours as American Goes Missing in Bahamas

According to information sourced by Fox News Digital, Brian Hooker’s sailboat ceased…
Utah toddlers kidnapped by broke dad found 800 miles away in Mexico after mom's desperate plea

Miraculous Rescue: Utah Toddlers Found 800 Miles Away in Mexico After Mom’s Heartfelt Plea

Authorities have successfully located two young brothers who were allegedly abducted by…
Facial hair takes role in California governor's race: 'Shows strength'

California Governor’s Race Sees Facial Hair as a Symbol of Strength

The race for governor in California has taken an unexpected twist, centering…