Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
RX Border Defense: Buyer Beware When It Comes to Cheap Chinese Drug Ingredients for Weight Loss

Caution Advised: The Risks of Inexpensive Chinese Ingredients in Weight Loss Drugs

“We are the largest and most obese nation globally, and I believe…
St. John, Indiana woman Jenna Strouble, charged with murder after 3 Crete Township deaths on East Norway Trail, waives extradition

Jenna Strouble Faces Murder Charges in Crete Township Triple Homicide; Waives Extradition to Indiana

In Crete Township, Illinois, a 30-year-old woman from Indiana has been formally…
$75 souvenir Ohtani cup comes with new perks after eye-watering price revealed

Exclusive Perks Unveiled for $75 Ohtani Souvenir Cup: Is It Worth the Investment?

Dodgers fans, it’s time to raise your cups—literally. The souvenir cup debacle…
White House unveils 'OnlyFarms' website, welcomes farmers with golden tractor on South Lawn

White House Launches ‘OnlyFarms’ Initiative with a Golden Tractor Celebration on the South Lawn

In a show of support for the agricultural community, the Trump administration…
California woman caught on camera smashing ex-boyfriend’s Tesla after breakup

California Woman Faces Legal Trouble After Allegedly Vandalizing Ex’s Tesla in Viral Video

A woman’s vengeful act against her ex-boyfriend’s Tesla was caught on camera,…
American Airlines flight attendant vanishes during Colombia layover: 'His family is desperate'

Tragic Discovery: Missing American Airlines Flight Attendant Found Dead in Colombia

The mayor of a Colombian town has revealed that a body, likely…
Iran-backed Houthis open third front against Israel as Tehran seeks leverage ahead of talks

Iran-Aligned Houthis Intensify Conflict with Israel Amid Strategic Moves by Tehran

The escalating tensions in the Middle East have taken a new turn…
Southwest pilot aborts Hollywood Burbank landing because runway 'wasn't quite clear': report

Swift Action by Southwest Pilot: Hollywood Burbank Landing Aborted Due to Unclear Runway Conditions

A Southwest Airlines pilot was captured on video explaining to passengers why…
US strikes against Iran-backed militias in Iraq reportedly continue as Baghdad warns of 'right to respond'

U.S. Airstrikes Target Iran-Backed Militias in Iraq Amid Baghdad’s Warning of Retaliation Rights

The United States military reportedly carried out airstrikes on Tuesday, targeting the…
'We've got to stand up': No Kings protest marches through downtown Jacksonville

Protesters in Downtown Jacksonville Rally for Change in ‘No Kings’ Demonstration

Protesters filled the streets, their signs held high and chants echoing as…
Facial recognition helped crack alleged student murder by illegal migrant – new bill could ban it: ret. cop

How Facial Recognition Technology Solved a Student Murder Case and Why Its Future Hangs in the Balance

In a controversial move, a proposed bill in the Illinois state legislature…
Missouri Rep. Eric Burlison is on a mission to visit secret UFO-related Army bases in US and around the world

Missouri Representative Eric Burlison Embarks on Global Tour of Classified UFO-Related Army Installations

A congressman from Missouri is embarking on a fascinating mission to uncover…