Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
UFC brings cage-match bout to the White House, home of a president who favors cage-match politics

UFC Announces Cage-Match Event at White House Amid President’s Controversial Political Style

Donald Trump has long been an advocate for the Ultimate Fighting Championship…
Cole's French Dip shuts doors for good after 118 years in LA

Iconic Los Angeles Eatery Cole’s French Dip Closes After 118 Years

After more than a hundred years of serving up classic fare, the…
Mounting concerns about Iran 'sleeper cells' after 1,500 stopped at border

Unveiling the Threat: 1,500 Iranian ‘Sleeper Cells’ Halted at Border Sparks Security Alarm

Under the Biden Administration, authorities intercepted approximately 1,500 Iranians at the border,…
LAUSD Insider Now Hit With $22M in Money-Laundering Charges

LAUSD Official Faces $22 Million Money Laundering Allegations

It seems that headlines featuring fraud and embezzlement are becoming more common…
Houston airport travelers reveal who they blame for TSA lines

Travelers Point Fingers: Who’s Really to Blame for Houston Airport’s TSA Delays?

Travelers expressed frustration with both political parties as they faced the daunting…
Who actually runs Iran right now? The key power players as Trump claims talks to 'top' official

Who Holds the Reins in Iran? Key Power Players Amidst Trump’s Claims of Engaging with ‘Top’ Officials

“Nobody knows who to talk to,” President Donald Trump remarked at the…
'Viva Cuba!' Activists, pols turn CUNY conference on commie country into anti-American hatefest

Controversy Erupts at CUNY Conference as Activists and Politicians Debate U.S.-Cuba Relations

“Viva la revolución 2.0!” was the rallying cry at a recent conference…
Meet Iran's hardline speaker who threatened to burn US forces — reportedly Tehran's point man for talks

Introducing Iran’s Hardline Speaker: The Influential Negotiator with a Controversial Stance on US Forces

In a surprising development, the Trump administration is reportedly considering Iranian parliament…
California's famous Justin Vineyards settles sexual harassment lawsuit

Justin Vineyards Faces Legal Reckoning: Sexual Harassment Lawsuit Settlement Rocks California Wine Industry

A winery proprietor in California has agreed to a $1.49 million settlement…
Wild San Francisco street fight shows young punks bashing man in broad daylight

Daylight Altercation in San Francisco Involves Group of Youths Assaulting Individual

A shocking video has surfaced, showing two young individuals viciously attacking a…
Crete house fire: 71-year-old Walter Palmer, 16-year-old Kassidy James, 11-year-old Mary James, 7-year-old Ivory James found dead

Tragic House Fire in Crete Claims Four Lives, Including Three Children

A family in Crete, Illinois, is grappling with unimaginable loss after a…
Up to 1,000 Iranian 'sleeper' agents embedded in Canada: Gov't official

Canadian Officials Reveal Presence of Potential Iranian Agents Across the Country

According to experts, up to 1,000 former members of Iran’s Islamic Revolutionary…