Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
More than 2 million pounds of barbecue pork jerky recalled over metal contamination

Urgent Recall: Over 2 Million Pounds of BBQ Pork Jerky Pulled from Shelves Due to Metal Contamination

Consumers are advised to dispose of the recalled jerky or return it…
Hormel recalls 4.8M pounds of frozen chicken after multiple metal contamination reports

Hormel Initiates Recall of 4.8 Million Pounds of Frozen Chicken Due to Metal Contamination Concerns

Hormel Foods has chosen to withdraw several of its frozen chicken offerings…
The Beatles' Ringo Starr, 85, on reuniting with Paul McCartney, the mantra he lives by and what he said to Prince William

Ringo Starr, 85, Shares Insights on Reuniting with Paul McCartney, His Life’s Mantra, and a Memorable Exchange with Prince William

LOS ANGELES — Ringo Starr, the iconic drummer of The Beatles, is…
Did anybody win Friday's $680M Mega Millions jackpot?

Discover if Anyone Hit the $680 Million Mega Millions Jackpot on Friday!

The latest Mega Millions jackpot reached a staggering $680 million, making it…
'Brazen' Louvre thieves made targeted heist, treasures could be melted down: expert

Inside the Heist: A Cinematic Jewel Theft Unfolds in Paris

Every now and then, journalists come across stories so extraordinary that even…
2 people, dog found dead inside Neptune Beach home: police

Two Individuals and a Dog Discovered Deceased in Neptune Beach Residence, Authorities Report

First Coast News reports the presence of multiple police vehicles currently stationed…
30 California District Attorneys Are a Hard NO on Prop. 50, Gavin Newsom's Gerrymandering Scheme

AAG Dhillon Calms Tensions with Governor Newsom Over DOJ Election Monitoring Plans

The Department of Justice has announced plans to deploy election monitors to…
Study links TikTok use to Gen Z women’s attraction to criminals

New Study Reveals Surprising Connection Between TikTok Use and Gen Z Women’s Interest in ‘Bad Boys

Gen Z Women Are Swooning Over These Men! A recent study reveals…
A General Dilemma the War Department Must Remedy

War Department Faces Critical Challenge Needing Resolution

In the iconic film “White Christmas,” a song poses a poignant question…
'I am not done': Kamala Harris hints at another White House bid

Kamala Harris Teases 2024 White House Run: ‘I Am Not Done’ Signals Potential Presidential Campaign

In a recent interview with the BBC, former Vice President Kamala Harris…
Heidi Klum drops 'cheeky' new Halloween costume hint

Heidi Klum Teases Fans with Playful Halloween Costume Preview: What to Expect This Year

Heidi Klum might be giving her fans a glimpse of her renowned…
SNAP benefits November: Donald Trump admin. won't tap contingency fund for food aid amid government shutdown 2025, memo says

Trump Administration Withholds SNAP Contingency Funds During 2025 Government Shutdown, Memo Reveals

WASHINGTON — The Trump administration has decided against using approximately $5 billion…