Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
The 'Bob's Burgers' cast theorize what the Belcher family's plans would be at San Diego Comic-Con

The ‘Bob’s Burgers’ Cast Speculates on the Belcher Family’s Activities at San Diego Comic-Con

SAN DIEGO — Ever wondered what the Belcher family would do at…
Mark Hamill talks 'The Long Walk' at San Diego Comic-Con: 'everyone will hate my guts' 

Mark Hamill Discusses ‘The Long Walk’ at San Diego Comic-Con: ‘People Will Despise Me’

Mark Hamill, Garrett Wareing and more share their take on the film’s…
More than 100 aid groups warn of starvation in Gaza as Israeli strikes kill 29, officials say

Israel’s Military Announces Aid Airdrops in Gaza Amid Hunger Crisis; At Least 53 Killed in Israeli Airstrikes and Gunfire

DEIR AL-BALAH, Gaza Strip — On Saturday night, Israel’s military declared plans…
UK Rolls Out Trans "Ally" Bathroom Invitation Badges to ID Minors Open to Contact

The UK Introduces Bathroom Badges to Support Transgender Inclusivity for Minors

A taxpayer-funded BBC employee in the UK is leading a campaign to…
Spice Girl Melanie C stuns on island getaway with Aussie boyfriend

Melanie C of Spice Girls Shines During Island Vacation with Australian Beau

Spice Girl Melanie C – aka Sporty Spice – looks as fit…
Court overturns conviction, orders new trial of man convicted in 1979 Etan Patz murder

Confession Doubts Emerge After Reversal in Etan Patz Murder Case, One of the First Milk Carton Missing Children

The man imprisoned for kidnapping and murdering a six-year-old boy in New…
Tropical bay with sailboats at sunset.

Seven-Year-Old Boy Attacked by 13ft Shark; Loses Part of Leg After Jumping into Water with Friends

A SEVEN-YEAR-OLD boy has been attacked by a 13ft shark who ripped…
RedState Weekly Briefing: Gutfeld's Had Enough - and So Have We All

RedState Weekly Update: Gutfeld Has Reached His Limit – and So Have We

Welcome to the RedState Weekly Briefing — a quick roundup of the…
Three illegal Salvadorans using dating app to meet teen girls nabbed in Houston sting operation: authorities

Houston Sting Operation: Three Salvadoran Men Arrested for Using Dating App to Target Teenage Girls

Three unauthorized migrants from El Salvador, who were reportedly attempting to exploit…
Renaming of military bases stirs debate over Confederate ties

Controversy Arises as Military Bases Undergo Name Changes Linked to Confederate History

In 2023, amid a national reckoning on issues of race in America,…

Three Individuals Found Unconscious Following Plane Crash in Ocean Near California Coast

MONTEREY, Calif. (AP) — Authorities reported that three individuals were discovered unresponsive…
Nighttime search and rescue operation at sea.

Search Underway for Missing Passengers After Plane Crashes Into Sea Off California Coast, Debris Found on Shore

A significant search and rescue operation is currently underway after a plane…