Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Illegal migrant arrested at SFO skipped court, evaded removal order

San Francisco Airport Arrest: Undocumented Immigrant Dodges Court and Avoids Deportation

Newly uncovered documents indicate that an individual detained by ICE at San…
Mom who killed boyfriend and cut off his genitals after catching him raping her daughter is cleared of murder

Mother Acquitted of Murder Charges After Fatal Confrontation with Boyfriend Accused of Assaulting Her Daughter

A mother in Brazil has been acquitted of murder after cutting off…
Influencer mom arrested over staging her own kidnapping to boost followers: cops

Influencer Mom Arrested for Faking Kidnapping in Shocking Plot to Gain Followers

A social media influencer has been taken into custody on charges of…
Doctor’s wife testifies he beat her with rock, tried to force her toward cliff edge during hike

Doctor’s Wife Reveals Harrowing Allegation of Assault During Hiking Trip

The wife of a Hawaiian doctor took the stand on Tuesday, recounting…
Fourth Labor Dept. staffer leaves amid IG probe of Secretary Lori Chavez-DeRemer

Fourth Staff Member Departs Labor Department as Inspector General Investigates Secretary Lori Chavez-DeRemer

WASHINGTON — A fourth staff member has departed from the Department of…
LA homeless housing in luxury areas costs taxpayers $1.5M per unit

LA’s Luxury Homeless Housing Costs Taxpayers $1.5M Per Unit: A Deep Dive into City Spending

In an astonishing move, Los Angeles is housing its homeless population in…
Repeat offender with long rap sheet allegedly guns down man while free on probation, bonds despite violations

Shocking Crime: Repeat Offender Allegedly Commits Murder While Out on Probation

A Texas teenager with a history of offenses finds himself incarcerated once…
Ex-Cook County Corrections Officer Alan Kettina sentenced in shooting, killing Mark Asber outside Miraj restaurant in Niles

Former Cook County Officer Alan Kettina Sentenced for Fatal Shooting of Mark Asber at Niles’ Miraj Restaurant

A former employee of the Cook County Corrections was handed a prison…
Iran's elite navy chief Alireza Tangsiri responsible for closing Strait of Hormuz killed in airstrike: reports

Reports: Iran’s Elite Naval Commander Alireza Tangsiri, Linked to Strait of Hormuz Operations, Killed in Airstrike

A prominent Iranian naval leader, known for authorizing the closure of the…
US Army raises enlistment age to 42 and eases marijuana policies to bolster ranks

US Army Adjusts Enlistment Age to 42 and Revises Marijuana Regulations to Strengthen Recruitment Efforts

The U.S. Army has announced a noteworthy shift in its recruitment policies,…
Family blames Florida AirBnb after girl drowns in pool

Tragic Pool Accident at Florida Airbnb: Family Seeks Answers After Young Girl’s Drowning

The parents of a 4-year-old autistic girl who tragically drowned at an…
Suspect in fatal New Jersey hit-and-run crash is illegal alien, fugitive: DHS

Illegal Alien and Fugitive Identified as Suspect in Tragic New Jersey Hit-and-Run: DHS Report

EXCLUSIVE: In a tragic incident in New Jersey, a man accused of…