Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
Minnesota sues Trump admin over sweeping immigration raids in Twin Cities

Minnesota Files Lawsuit Against Trump Administration Over Controversial Immigration Raids in Twin Cities

Minnesota, along with the cities of Minneapolis and St. Paul, has filed…
Jacksonville man charged with attempted murder after shootout with covert officers in Brentwood

Jacksonville Resident Faces Attempted Murder Charges After Brentwood Shootout with Undercover Officers

A man is confronting several charges following a gunfight with Jacksonville Sheriff’s…
Supreme Court takes up culture war battle over transgender athletes in school sports

Supreme Court to Decide Pivotal Case on Transgender Athlete Inclusion in School Sports

The Supreme Court is set to deliberate on Tuesday regarding state laws…
Timothy Shea Build The Wall

Bondi Criticizes Biden Administration for Alleged Targeting of Border Wall Construction Company

Timothy Shea remains the only defendant still incarcerated in the “We Build…
Minnesota Prays for a Judge to Jump in, Takes Trump Admin to Court Over ICE Actions

Minnesota Seeks Judicial Intervention, Challenges Trump Administration’s ICE Policies in Court

More than five years have passed since the George Floyd protests shook…
NYC hospital accuses nurses’ union of seeking protections for workers arriving drunk, high as strike begins

NYC Hospital Claims Nurses’ Union Demands Protections for Impaired Workers Amidst Strike

On Monday, thousands of medical professionals took a historic step by participating…
CTA Pink Line shooting: Pedro Villarreal charged with murder of Raymond S. Harrison Jr. in Loop, Chicago on train: police

Chicago Loop Pink Line Tragedy: Pedro Villarreal Charged with Murder of Raymond S. Harrison Jr. on CTA Train

In a recent development in Chicago, a man has been officially charged…
Classes canceled at St. Mary's Elementary until Tuesday due to 'unexpected safety issue'

St. Mary’s Elementary Shuts Down Classes Until Tuesday Over Unforeseen Safety Concerns

St. Mary’s Elementary School has reported sightings of bats within certain areas…
Long Island HS basketball player, 15, punched by fan, 36, during brawl at game

36-Year-Old Spectator Assaults 15-Year-Old Basketball Player in Long Island High School Game Altercation

A Suffolk County resident is facing legal charges following an altercation during…
Florida Senate committee unanimously approves bill on child protective investigations

Florida Senate Committee Gives Green Light to New Child Protection Bill

Two bills, SB 42 and HB 47, have been introduced to enhance…
Former Navy SEAL convicted for trying to harm police with explosives during California 'No Kings' protests

Ex-Navy SEAL Found Guilty of Attempted Bomb Attack on Police at California ‘No Kings’ Protests

In a striking verdict delivered on Friday, a federal jury found Gregory…
Man killed in massive 17-vehicle pileup crash on Highway 99 in Fresno, California amid dense fog

Tragic Highway 99 Fog Crash: 17-Vehicle Pileup Claims Life in Fresno

FRESNO, Calif. — The California Highway Patrol is currently delving into a…