Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Nearly 20 decomposing bodies found stashed in hidden room of funeral home run by county coroner

Almost 20 Decomposed Bodies Discovered in Secret Room of Funeral Home Operated by County Coroner

State inspectors have uncovered nearly 20 decomposing bodies hidden inside a Colorado…
California parents arrested, charged with murder of missing 7-month-old son after mother's story falls apart

California Parents Arrested and Charged with Murder of Missing 7-Month-Old Son After Mother’s Story Unravels

The parents of a 7-month-old child from Southern California, who were initially…
Ancient artifacts of sunken city likely destroyed by earthquake or tsunami plucked from seafloor

Historic Relics of Submerged City, Probably Lost to Earthquake or Tsunami, Recovered from the Ocean Floor

<!–> Texas archaeologists discover ancient ruler’s tomb in Belize Texas archaeologists Arlen…
Hurricane Erin never hit land or caused major damage, but threatened turtle nests weren't so lucky

Hurricane Erin Spared Land While Endangering Turtle Nests

As Hurricane Erin battered North Carolina’s barrier islands with fierce winds and…
Cyril Bird in military jacket and beret.

World War II Hero Who Battled Nazis in Africa and Europe Passes Away at 101

A WORLD War Two hero who fought the Nazis in Africa and…
Nigel Farage at a press conference.

Nigel Farage reveals strategy to deport asylum seekers on 5 flights daily & plans to prohibit claims from small boat arrivals

NIGEL Farage has unveiled a “mass deportation” plan with five flights out…
Trump delivered strategic blow to Iran regime with bold Azerbaijan-Armenia pact

Trump Strikes a Strategic Blow to the Iranian Regime with a Bold Agreement Between Azerbaijan and Armenia

President Donald Trump’s new deal in the South Caucasus has ended a…
Mosquitos in Glynn County test positive for West Nile virus

Mosquitoes Found with West Nile Virus in Glynn County

The positive cases were all found in mosquitoes, and there are currently…
Bodycam footage of a crime scene with a blurred person wrapped in a blanket.

Idaho Roommate Shares Eerie Final Words From Bryan Kohberger to Victim Before Murders in Recently Released Bodycam Video

BRYAN Kohberger’s chilling last words to one of his victims have been…
JSO: Duval County jail inmate dies week after fight with cellmate

Duval County Jail Inmate Passes Away One Week After Altercation with Cellmate

An incident between Jema Schunke and Virginia Hampton occurred on Friday, August…
Keffiyeh-clad anti-ICE protester threatens to stab agent, harm family in San Francisco mob attack

Protester Wearing Keffiyeh Allegedly Threatens ICE Agent and Family During San Francisco Demonstration

A mob of up to 20 anti-ICE protesters swarmed and attacked immigration…
Biden judge frees teens tied to ex-DOGE staffer's assault as Trump cracks down on crime

Judge Appointed by Biden Releases Teens Linked to Ex-DOGE Staffer’s Assault as Trump Intensifies Crime Crackdown

Two teenagers charged in connection with an attempted carjacking and brutal beating…