Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
NYC pothole reports are shattering records – over 22K – with 2026 marking first triple-digit spike in calls

NYC Pothole Crisis: Record-Breaking 22K Reports in 2026 Signal Triple-Digit Surge!

New York City is experiencing a surge of pothole complaints this year,…
Champagne socialists in Cuba stage concert, stay in 5-star hotel as country plunges into nationwide blackout

Wealthy Activists Host Concert in Cuba’s Luxury Hotel Amid Widespread Blackouts

In the midst of Cuba’s third power outage this month, a group…
SoCal military families react as soldiers deploy to Middle East

Southern California Military Families Respond to Troop Deployments in Middle East

Last week, emotional goodbyes swept through Southern California as numerous Marines embarked…
Israel imposes further airport restrictions following weekend missile attacks

Israel Tightens Airport Security Amid Recent Missile Strikes

In response to a surge in Iranian missile attacks over the weekend,…
Iran chokes Strait of Hormuz with reported $2M tanker toll, regime threatens global oil supply

Iran Imposes $2M Toll on Tankers, Putting Global Oil Flow at Risk in Strait of Hormuz

According to recent reports, Iran is imposing a hefty charge of $2…
Illegal migrant charged with murdering Loyola student Sheridan Gorman

Undocumented Immigrant Faces Charges in the Murder of Loyola Student Sheridan Gorman

A tragic incident unfolded in Chicago as a Venezuelan migrant, who entered…
'A dear friend to Charlie'

Charlie Honors Cherished Friendship

Tragedy struck as Jeff Webb, a prominent figure in the business and…
Man charged with murder of Loyola student Sheridan Gorman expected in court; DHS says Jose Medina is an undocumented imigrant

Suspect in Loyola Student Sheridan Gorman’s Murder Case, Jose Medina, an Undocumented Immigrant, to Appear in Court, DHS Reports

A man charged with the murder of a Loyola University freshman is…
Atlanta airport reports 5-hour TSA lines

Atlanta Airport’s TSA Woes: Navigating 5-Hour Security Delays

WASHINGTON — On Sunday, passengers at the globe’s busiest airport faced grueling…
Trump ties deal reopening DHS to passage of SAVE America Act, implores Republicans 'kill the filibuster and stay in DC for Easter'

Trump Pushes GOP to End Filibuster for DHS Deal and SAVE America Act Passage Before Easter

On Sunday evening, President Trump firmly stated his stance against any proposal…
McDonalds in Chinese city pilots humanoid robots to serve meals, greet customers

Revolutionizing Fast Food: McDonald’s Introduces Humanoid Robots in China to Elevate Customer Experience

In a novel experiment, a McDonald’s restaurant in a Chinese city has…
2-year-old boy shot, killed after argument among adults leads to shooting on the Westside

Tragic Shooting: Toddler Fatally Caught in Crossfire of Westside Dispute

According to JSO detectives, a tragic incident unfolded overnight on Jacksonville’s Westside,…