Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Cook County Board President Toni Preckwinkle unconcerned about potential primary challenge from Chicago Alderman Brendan Reilly

Cook County’s Toni Preckwinkle Unfazed by Possible Primary Opposition from Chicago’s Alderman Brendan Reilly

CHICAGO (WLS) — Cook County Board President Toni Preckwinkle is brushing off…
Transgender women soccer players to be banned from English women's teams

English Women’s Soccer Teams to Prohibit Transgender Women Players

LONDON — Transgender women will be banned from playing on women’s soccer…
North Bergen kidnapping: Nurses, good Samaritans thwart abduction attempt to take teen in NJ

North Bergen Abduction Foiled: Nurses and Bystanders Rescue Teen from Kidnapping Effort in NJ

NORTH BERGEN, New Jersey — Police in New Jersey have arrested a…
Graduates complete Cook County Mental Health Court program, celebration held at Skokie courthouse

Celebration Held at Skokie Courthouse for Cook County Mental Health Court Program Graduates

SKOKIE, Ill. (WLS) — It is graduation season. On Thursday, a graduation…
Star Wars fans can celebrate "May the Fourth" at Disney Parks around the world

Celebrate “May the Fourth” at Disney Parks Globally for Star Wars Enthusiasts

As the unofficial Star Wars holiday, May the 4th approaches, Star Wars…
Chicago crime: Police searching for arson suspect in Molotov cocktail attacks to cannabis dispensaries on Milwaukee, Damen avenues

Chicago Crime Update: Manhunt Underway for Arson Suspect Linked to Molotov Cocktail Incidents at Cannabis Shops on Milwaukee and Damen Avenues

CHICAGO (WLS) — Chicago police are searching for a man accused of…
Reduced hours at Chicago Animal Care and Control shelter may be creating more work for Chicago Police Department

Chicago Animal Care and Control Shelter’s Limited Hours Possibly Increasing Work for Chicago Police Department

CHICAGO (WLS) — Recent changes at Chicago Animal Care and Control are…
Harvey Weinstein's lawyers get their turn to question accuser Miriam Haley at #MeToo retrial

Harvey Weinstein’s Attorneys Begin Questioning Accuser Miriam Haley in #MeToo Retrial

NEW YORK — Harvey Weinstein ‘s lawyers got their turn Thursday to…
US Department of Education investigating Evanston-Skokie School District 65 for allegations of racial discrimination

U.S. Education Department Probes Evanston-Skokie School District 65 Over Racial Bias Claims

COOK COUNTY, Ill. (WLS) — The federal Department of Education has launched…
Montgomery crash involving semi truck kills Jason Messick of Oswego at Route 31, Caterpillar Drive: police

Semi-Truck Crash in Montgomery Claims Life of Oswego’s Jason Messick at Route 31, Caterpillar Drive

MONTGOMERY, Ill. (WLS) — One person died in a serious crash Wednesday…
The Walt Disney Company breaks record for most News & Documentary Emmy Award nominations ever

The Walt Disney Company Sets New Record for Most News & Documentary Emmy Award Nominations

LOS ANGELES — The Walt Disney Company broke another record for the…
Starbase, Texas: Home of Elon Musk's SpaceX could become an official Texas city despite public push back

Efforts Continue to Make SpaceX’s Starbase in Texas a Formal City Despite Opposition

AUSTIN, Texas — Elon Musk has for years made Texas a business…