Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Zodiac Killer may be tied to Black Dahlia case after ‘code cracked,’ new suspect emerges

New Breakthrough Links Zodiac Killer to Black Dahlia Mystery: Unveiling a New Suspect

The infamous Zodiac Killer, known for his cryptic messages that taunted law…
Elon Musk and Sam Altman head to court with tough judge who took on Apple firing warning shot at billionaires

Elon Musk and Sam Altman Face Court Proceedings with Noted Judge Known for Challenging Apple, Signaling a Stark Message to Billionaires

In the heart of Silicon Valley, some of the tech world’s most…
Activists erupt as rescued ducks are sold off like cheap chicken

Activists Rally Against Sale of Rescued Ducks, Equating Treatment to Low-Grade Poultry

A mass surrender of ducks in Southern California has sparked controversy after…
Former North Carolina police officer accused of threatening mass shooting at New Orleans festival

Ex-North Carolina Police Officer Faces Allegations of Threatening Mass Shooting at New Orleans Festival

A former police officer from North Carolina was taken into custody late…
HUD encourages real estate industry to share neighborhood school and crime data after listing platforms stopped under Biden: ‘Wrongly equated with racial discrimination’ 

HUD Urges Real Estate Industry to Share Local School and Crime Data, Addressing Concerns Over Racial Discrimination Under Biden Administration

On Friday, the Department of Housing and Urban Development (HUD) called upon…
Long Island cops unveil futuristic guns that could fire GPS darts during chases

Long Island Police Introduce High-Tech Guns Equipped with GPS Tracking Darts for Pursuit Operations

Sheriff’s deputies in Suffolk County are experimenting with innovative firearms designed to…
Rob and Michele Singer Reiner deaths: Son Jake speaks out about deaths of his parents for first time

Jake Reiner Breaks Silence on the Heartbreaking Loss of Parents, Rob and Michele Singer Reiner

LOS ANGELES — Jake Reiner, the eldest child of acclaimed filmmaker Rob…
Trump Extends Israel-Lebanon Ceasefire 3 Weeks After Landmark White House Talks

Trump Prolongs Israel-Lebanon Ceasefire Following Pivotal White House Discussions Three Weeks Ago

On Thursday, President Donald Trump announced an extension of the ceasefire between…
Oklahoma tornado barrels through Enid, damaging homes and shutting down roads

Tornado Strikes Enid, Oklahoma: Homes Damaged and Roads Closed

On Thursday, a devastating tornado tore through Oklahoma, ripping the roofs off…
Chicago crime: Police release video of suspects in deadly Little Village shooting at 26th Street, St. Louis Avenue

Chicago Police Share Video of Suspects in Fatal Little Village Shooting on 26th & St. Louis

In a bid to advance their investigation into a tragic shooting, the…
Scathing audit performed on West Suburban Medical Center, shuttered Oak Park, Illinois hospital thousands relied on

Critical Audit Uncovers Issues at West Suburban Medical Center, Former Lifeline for Oak Park, IL Residents

OAK PARK, Ill. — Late Tuesday afternoon, the ABC7 I-Team uncovered that…
Rare moment caught on camera as three tornadoes touch down in California

Unprecedented Capture: Trio of Tornadoes Simultaneously Touch Down in California

On Tuesday, California’s Central Valley experienced a rare surge of tornado activity,…