Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Driver busted outside Nancy Guthrie's house after driving past up to 100 times with photo of missing mom

Obsessive Driver Arrested: The Shocking Story Behind the 100 Passes with Missing Mom’s Photo Near Nancy Guthrie’s Residence

Late Thursday evening, law enforcement apprehended a man for driving under the…
California serial child rapist granted parole admitted having pedophilic fantasies as recently as 2021

Outrage Erupts as California Frees Serial Child Rapist with Recent Pedophilic Fantasies

A convicted child rapist from California, who faced widespread public outcry over…
New Illinois legislation proposal aimed at reforming how medical examiners ID bodies, notify family, ease legal action option

Illinois Proposes Groundbreaking Legislation to Overhaul Medical Examiner Procedures and Enhance Family Notification

CHICAGO (WLS) — In response to revelations by the ABC7 Chicago I-Team…
Biden flies to South Carolina for rare campaign stop after cancer diagnosis -- tries to take 'awkward' selfie with supporter 

Biden’s South Carolina Campaign Visit: Post-Cancer Diagnosis, Selfie Mishap, and Political Resurgence

Former President Joe Biden faced travel hiccups on Friday but opted for…
Cargo plane carrying money crashes near capital of Bolivia — at least 15 dead, official says

Tragic Cargo Plane Crash Near Bolivian Capital Results in At Least 15 Fatalities, Official Reports

A cargo plane loaded with money met a tragic end on Friday…
Israel will re-invade Gaza if Hamas doesn't abide by Trump's peace plan

Israel Threatens Gaza Re-Entry Over Hamas Non-Compliance with Trump’s Peace Plan

Israel is prepared to re-enter Gaza to dismantle Hamas if the group…
Man accused of spraying anti-ICE graffiti at Oklahoma Capitol is registered child sex offender, charges filed

Oklahoma Capitol Vandalism: Charges Filed Against Registered Child Sex Offender for Anti-ICE Graffiti

A registered sex offender has been accused of vandalizing Oklahoma’s State Capitol…
High school basketball coach charged with raping foster daughter, serving victim tequila shots: report

High School Basketball Coach Faces Charges for Alleged Assault and Supplying Alcohol to Foster Daughter

A basketball coach from Massachusetts is facing serious allegations, accused of sexually…
Trump orders US agencies to stop using Anthropic technology in clash over AI safety

Trump Bans Anthropic AI in Federal Agencies Amid Safety Concerns: A Pivotal Move in US Tech Policy

Just over an hour before the Pentagon’s deadline, former President Donald Trump…
ICE arrests illegal immigrant training as Pennsylvania corrections officer suspected of rape

ICE Detains Pennsylvania Corrections Officer Trainee Accused of Rape and Immigration Violations

An undocumented immigrant in Pennsylvania, who was in training to become a…
Trump Directs Federal Agencies to 'Cease' Use of Anthropic Technology

Trump Orders Federal Agencies to Halt Use of Controversial Anthropic Technology: What You Need to Know

President Donald Trump has issued an order for federal agencies to “IMMEDIATELY…
Crystal Mangum, who falsely accused Duke lacrosse players of rape, freed after killing boyfriend

Crystal Mangum, known for the Duke lacrosse case, released following conviction for boyfriend’s death

Crystal Mangum, once an exotic dancer who gained notoriety for falsely accusing…