Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Jackson Chance Foundation to raise money for NICU families at 11th annual Ping Pong Ball at SPIN Chicago in River North

Join the 11th Annual Ping Pong Ball at SPIN Chicago to Support NICU Families with Jackson Chance Foundation

CHICAGO — The Jackson Chance Foundation is gearing up for its 11th…
ICE in Chicago: Judge rules detainment of Ruben Torres Maldonado, father of teen with cancer unlawful, orders prompt bond hearing

Chicago Court Declares ICE Detainment of Cancer Patient’s Father Unlawful, Orders Immediate Bond Hearing

In a significant development on Friday, a federal judge issued a new…
Security video captures moment former mayor stabbed multiple times outside his blue state business

Security Footage Reveals Former Mayor Stabbed Repeatedly Outside Business in Blue State

A dramatic incident unfolded outside a Massachusetts cannabis dispensary on Monday afternoon,…
NFL news: Baltimore Ravens quarterback Lamar Jackson out for Chicago Bears Week 8 matchup on Sunday

Lamar Jackson to Miss Week 8 Showdown: Ravens Face Bears Without Star Quarterback

BALTIMORE — The upcoming weekend marks a pivotal moment for the Baltimore…
Nelly Furtado stepping away from performing for 'the foreseeable future'

Nelly Furtado Announces Hiatus from the Stage: What’s Next for the Pop Star?

More than a year after unveiling her seventh studio album “7,” Nelly…
One hospitalized after suspected shooting at Austin public library prompting massive police response

Suspected Shooting at Austin Library Triggers Major Police Response, One Person Hospitalized

A serious incident unfolded at the Austin Central Library in Texas on…
'I'm not afraid of anything anymore': St. Augustine pastor reflects one year after brutal stabbing

St. Augustine Pastor Finds Courage and Resilience One Year After Surviving Brutal Stabbing

“Stabbings are not something that happen at breakfast,” remarked Rev. Matt Marino,…
Convicted illegal immigrant child killer who murdered infant son arrested in Twin Cities ICE sweep

ICE Operation in Twin Cities Leads to Arrest of Convicted Child Killer and Illegal Immigrant

Homeland Security Secretary Kristi Noem recently revealed that Immigration and Customs Enforcement…
Army Officer Court-Martialed Over COVID Rules by Vindman Brother Finally Reinstated on Active Duty

Army Officer Rejoins Active Duty After Vindman Brother’s Court-Martial Overturns COVID Rules Case

Mark Bashaw, previously the only member of the Armed Forces to face…
Neutrogena makeup remover wipes recalled over bacterial contamination

Neutrogena Recalls Makeup Wipes Due to Bacterial Contamination Concerns

More than 1,300 cases of makeup remover wipes distributed across four states…
Chicago shooting: Family pleads for justice after Princeton Miller shot, killed in West Loop during brother's birthday celebration

Family Seeks Justice After Princeton Miller Fatally Shot in Chicago’s West Loop During Birthday Celebration

A grieving family is desperately seeking answers over a month after their…
California union proposes taxing billionaires to offset Medicaid cuts for low-income people

California Union Advocates for Tax on Billionaires to Counteract Medicaid Reductions for Low-Income Residents

SACRAMENTO, Calif. — On Thursday, a significant union proposed a one-time 5%…