Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Maduro sings John Lennon's 'Imagine' at rally as US warships patrol Venezuelan waters

Maduro Performs John Lennon’s ‘Imagine’ Amidst US Naval Presence Near Venezuelan Waters

During a rally on Saturday, Venezuelan President Nicolás Maduro surprised the crowd…
Britain announces sweeping asylum policy shift to cut protections for refugees

UK Unveils Major Changes to Asylum Policies, Reducing Refugee Protections

Significant reforms are on the horizon for Great Britain’s asylum policies, signaling…
Boat explosion on Hudson River leaves 1 dead in New York

Unbelievable Canine Mishap: Man Survives Unintentional Shooting by His Own Dog

A Pennsylvania man is on the mend after an unusual incident involving…
Donald Trump pardons Jan. 6 defendant, Daniel Edwin, for separate gun offense, releasing him from prison

Donald Trump Grants Pardon to January 6 Defendant Daniel Edwin for Unrelated Gun Charge, Leading to His Release

WASHINGTON — In a notable move, former President Donald Trump has granted…
Wrigley Field college football game: Zvada's 31-yard field goal as time expires lifts No. 18 Michigan over Northwestern 24-22

Michigan Triumphs Over Northwestern 24-22 with Zvada’s Last-Second 31-Yard Field Goal at Wrigley Field

CHICAGO — With the clock ticking down, Dominic Zvada clinched victory for…
Longtime 'Grey's Anatomy' star diagnosed with prostate cancer

Beloved ‘Grey’s Anatomy’ Actor’s Courageous Battle with Prostate Cancer Revealed

James Pickens Jr. is underscoring the critical role of regular screenings following…
One person dead in Roosevelt Island building fire, 3 firefighters injured

Tragic Roosevelt Island Fire Claims One Life, Injures Three Brave Firefighters

Tragedy struck Roosevelt Island on Saturday evening when a fire claimed the…
Chinese coast guard conducts patrol through disputed Senkaku Islands waters following Taiwan spat

Chinese Coast Guard Patrols Disputed Senkaku Islands Amid Rising Tensions with Taiwan

On Sunday, ships from the Chinese coast guard navigated the waters surrounding…
Religious freedom fight grows in Massachusetts community over statues honoring police and firefighters

Massachusetts Community Faces Intensifying Debate on Religious Freedom Over Police and Firefighter Statues

The community of Quincy, Massachusetts, finds itself embroiled in a contentious debate…
Readers sound off on the shutdown deal, rescued green space and labor negotiations

Voices Heard: Public Reactions to Government Shutdown Deal, Revived Green Spaces, and Labor Talks

The Dems’ secret strategy to the shutdown deal In Pasadena, California, the…
Former CBP officer sentenced to 15 years in prison for role in drug trafficking scheme at southern border

Ex-CBP Officer Receives 15-Year Sentence for Involvement in Southern Border Drug Trafficking Operation

A former officer with U.S. Customs and Border Protection (CBP) has been…
Chargers vs. Jaguars, 49ers vs. Cardinals: NFL Week 11 odds, picks

NFL Week 11: Analyzing Odds and Predictions for Chargers vs. Jaguars and 49ers vs. Cardinals Matchups

Gambling content 21+. The New York Post may receive an affiliate commission…