Share this @internewscast.com

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time. 

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Share this @internewscast.com
You May Also Like
City of Atlantic Beach warns about permit email scam

Atlantic Beach City Issues Alert on Fraudulent Permit Email Scam

Atlantic Beach Police are investigating an email phishing scam where the senders…
'It's Always Sunny in Philadelphia' celebrates milestones and memorable moments ahead of season 17 premiere.

‘It’s Always Sunny in Philadelphia’ Marks Major Milestones and Iconic Moments as Season 17 Approaches.

Just when you thought “It’s Always Sunny in Philadelphia” couldn’t get any…
Passengers and first responders at an ICE train following an axe attack.

An attacker armed with an axe and a hammer injures at least four people in a rampage on a crowded high-speed train in Germany.

At least four individuals sustained injuries when a man wielding an axe…
SEE IT: Small plane flips over in South Carolina backyard after pilot's deliberate maneuver

Watch This: Small Plane Flips in South Carolina Backyard Due to Pilot’s Intentional Move

Home surveillance footage captured the startling moment a single-engine Cessna 150 flipped…
Lawsuit alleges Chicago police pattern of excessive force in officer-involved shootings; Timothy Glaze killed in Little Village

Lawsuit Claims Chicago Police Exhibit Consistent Use of Excessive Force in Officer-Involved Shootings; Timothy Glaze Fatality in Little Village Highlighted

CHICAGO (WLS) — Warning: Some viewers may find the video content in…
Rep. Eugene Vindman is 'furious' about US pause on weapons to Ukraine

Rep. Eugene Vindman ‘outraged’ by U.S. decision to halt arms to Ukraine

Democratic Rep. Eugene Vindman said he is “furious” that the Pentagon has…
Underwater view of a great white shark with its mouth open.

Monitor Shark Activity: Stay Alert with Real-Time Great White Shark Sightings Before the Holiday Weekend

VACATIONERS have been warned to double-check the coasts with a newly released…
Bride 'heartbroken' after $15K dream dress turns into 'nightmare'

Bride Devastated After $15K Dream Dress Becomes a ‘Nightmare’

An Aussie bride has been left in tears just four weeks before…
House Republicans are pushing Trump's big bill to the brink of passage

Republicans in the House are close to passing a major bill backed by Trump

WASHINGTON (AP) — House Republicans are set to vote on President Donald…
What charges did Sean 'Diddy' Combs face at his trial?

What were the legal accusations against Sean ‘Diddy’ Combs in court?

The three-time Grammy Award winner pleaded not guilty to five felony charges,…
Alabama woman remains in hospital after being beaten with vegetable can

Alabama Woman Hospitalized Following Assault Involving Vegetable Can

DEKALB COUNTY, Ala. (WHNT) — The DeKalb County Sheriff’s Office said it…
Plane landing in strong winds at Madeira Airport.

Terrifying Landing at Notorious Holiday Airport: Plane Packed with Brits Bounces in Strong Winds

THIS is the tense moment when a crowded holiday plane is rocked…