Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Zohran Mamdani Touts His 'Racial Equity Plan' for NYC — What Could Go Wrong?

Zohran Mamdani’s Bold Racial Equity Plan for NYC: Exploring Potential Challenges and Impact

We’ve witnessed this scenario play out before: discrimination cloaked in terms like…
Repeat offender with 19 felonies busted after wild caught-on-camera chase: police

Notorious Felon with 19 Convictions Captured in Thrilling High-Speed Pursuit: Watch the Unbelievable Footage

A repeat offender with a history of violence is once again in…
Iran shares video of destroyed US aircraft blown up amid daring rescue mission

Iran Releases Footage of US Aircraft Destruction Amid Bold Rescue Operation

Iranian state media recently circulated images and footage that allegedly depict the…
Man caught on video wielding 13-inch kitchen knife at NYC grocery store moments before police shoot him

Man Brandishing 13-Inch Kitchen Knife in NYC Grocery Store Confronted by Police Before Shooting

A man brandishing a 13-inch kitchen knife was shot by police outside…
Trump congratulates Artemis II crew following historic trip around the moon

Trump Hails Artemis II Crew’s Groundbreaking Lunar Journey: A New Era in Space Exploration

On Monday night, President Trump engaged in a conversation with the Artemis…
Son of Hollywood director accused of years of sexual, racial abuse of water polo teammate at ritzy prep school

Hollywood Director’s Son Faces Allegations of Prolonged Sexual and Racial Abuse at Elite Prep School

An elite private school in Los Angeles is at the center of…
Tyler Robinson defense asks court to bar cameras for next in-person hearing

Tyler Robinson’s Legal Team Seeks to Exclude Cameras from Upcoming Courtroom Proceedings

On Sunday, Tyler Robinson, who stands accused of the murder of Charlie…
Intelligent alien life is out there — and its technology could destroy us in a microsecond, researchers claim

Researchers Warn: Advanced Alien Technology Could Pose Immediate Threat to Earth

Experts in the field of extraterrestrial research are positing that powerful alien…
CENTCOM commander directed strike against an IRGC headquarters in underground facility: sources

Breaking: CENTCOM Commander Orchestrates Precision Strike on IRGC Underground Headquarters

High-level sources have revealed to Fox News that amidst a rescue mission…
Mauro compares Iran rescue of missing colonel to Maduro capture, credits intelligence preparation

Mauro Draws Parallels Between Iran’s Rescue of Missing Colonel and Maduro’s Apprehension, Highlights Role of Strategic Intelligence

CIA deception operation rescues missing US airman in Iran Paul Mauro, a…
Video shows Israeli woman knocked to the ground as missile explodes next to her -- before getting up and walking away

Shocking Video: Israeli Woman Survives Close Call as Missile Detonates Nearby, Walks Away Unscathed

Astonishing footage captured the moment an Israeli civilian was sent sprawling to…
Trump says he has given Iranian negotiators 'immunity from death'

Trump Claims to Offer Iranian Negotiators Unprecedented ‘Immunity from Death

On Sunday, President Trump asserted that he has extended an “immunity from…