Are you still smarter than an AI? There’s a way to keep track

Visitors can see how each model performs on specific benchmarks, as well as what its average score is overall. No model has yet achieved a perfect score of 100 points on any benchmark. Smaug-72B, a new AI model created by the San Francisco-based startup Abacus.AI, recently became the first to break past an average score of 80.

Many of the LLMs are already surpassing the human baseline level of performance on such tests, indicating what researchers call “saturation.” Thomas Wolf, a co-founder and the chief science officer of Hugging Face, said that usually happens when models improve their capabilities to the point where they outgrow specific benchmark tests — much like when a student moves from middle school to high school — or when models have memorized how to answer certain test questions, a concept called “overfitting.”

When that happens, models do well on previously performed tasks but struggle in new situations or on variations of the old task.

“Saturation does not mean that we are getting ‘better than humans’ overall,” Wolf wrote in an email. “It means that on specific benchmarks, models have now reached a point where the current benchmarks are not evaluating their capabilities correctly, so we need to design new ones.”

Some benchmarks have been around for years, and it becomes easy for developers of new LLMs to train their models on those test sets to guarantee high scores upon release. Chatbot Arena, a leaderboard founded by an intercollegiate open research group called the Large Model Systems Organization, aims to combat that by using human input to evaluate AI models.

Parli said that is also one way researchers hope to get creative in how they test language models: by judging them more holistically, rather than by looking at one metric at a time.

“Especially because we’re seeing more traditional benchmarks get saturated, bringing in human evaluation lets us get at certain aspects that computers and more code-based evaluations cannot,” she said.

Chatbot Arena allows visitors to ask any question they want to two anonymous AI models and then vote on which chatbot gives the better response.

Its leaderboard ranks around 60 models based on more than 300,000 human votes so far. Traffic to the site has increased so much since the rankings launched less than a year ago that the Arena is now getting thousands of votes per day, according to its creators, and the platform is receiving so many requests to add new models that it cannot accommodate them all.

Chatbot Arena co-creator Wei-Lin Chiang, a doctoral student in computer science at the University of California-Berkeley, said the team conducted studies that showed crowdsourced votes produced results nearly as high-quality as if they had hired human experts to test the chatbots. There will inevitably be outliers, he said, but the team is working on creating algorithms to detect malicious behavior from anonymous voters.

As useful as benchmarks are, researchers also acknowledge they are not all-encompassing. Even if a model scores well on reasoning benchmarks, it may still underperform when it comes to specific use cases like analyzing legal documents, wrote Wolf, the Hugging Face co-founder.

That’s why some hobbyists like to conduct “vibe checks” on AI models by observing how they perform in different contexts, he added, thus evaluating how successfully those models manage to engage with users, retain good memory and maintain consistent personalities.

Despite the imperfections of benchmarking, researchers say the tests and leaderboards still encourage innovation among AI developers who must constantly raise the bar to keep up with the latest evaluations.

Hammer-Wielding Man Chases Woman Into NYC Jewish School, Police Say

A man carrying a hammer and described by law enforcement sources as…

Internewscast
July 28, 2026

Gunman Shoots at Toronto U.S. Consulate, 2nd Time This Year

A gunman opened fire on the U.S. Consulate in Toronto on Monday,…

Internewscast
July 28, 2026

Vance, Thune Honor Lindsey Graham as American Original in Tributes

WASHINGTON — Vice President JD Vance, Senate Majority Leader John Thune and…

Internewscast
July 28, 2026

Bryan Kohberger Victim’s Dad Wants Retrial for Justice

The father of one of Bryan Kohberger’s victims says he would not…

Internewscast
July 28, 2026

Chicago Cleans Up as NWS Surveys Tornado-Warned Storm Damage

HAZEL CREST, Ill. () — Two National Weather Service survey teams are…

Internewscast
July 28, 2026

NateDeBaiits’ Fatal California Birthday Party Captured on Livestream Video

A Twitch streamer is addressing the public after a burst of gunfire…

Internewscast
July 28, 2026

Woman wins massive settlement after bladder mistakenly removed during surgery for benign cyst

Woman Wins Major Settlement After Bladder Mistakenly Removed During Benign Cyst Surgery

A Maine woman has been awarded $17 million after a routine surgery…

Internewscast
July 28, 2026

Big Bear Bald Eagle Jackie Improves After Setback, Remains Critical at Ojai Raptor Center

BIG BEAR LAKE, Calif. — Jackie, the well-known Big Bear bald eagle,…

Internewscast
July 28, 2026

Berlin Pride attack believed to be an act of Islamic terror, suspect on the run: German minister

Why Was Berlin Attack Suspect Free Despite Alleged ISIS Ties?

German officials are facing urgent questions after a deadly attack near a…

Internewscast
July 28, 2026

Trump Holds Separate Talks With Zelenskyy and Netanyahu Amid Iran and Ukraine Wars

Washington — President Trump held separate White House meetings Tuesday with Ukrainian…

Internewscast
July 28, 2026

Heroic 16-Year-Old Lifeguard Saves Child From Monster Surf in Dramatic Rescue Video

A dramatic rescue played out on a Santa Cruz beach after a…

Internewscast
July 28, 2026

Ruby Rippey Disputes Gavin Newsom’s Affair Claims in Essay

Ruby Rippey, a former San Francisco City Hall aide to Gov. Gavin…

Internewscast
July 28, 2026

Are you still smarter than an AI? There’s a way to keep track

Up next

Author

Internewscast

Share article

Leave a Reply Cancel reply

Hammer-Wielding Man Chases Woman Into NYC Jewish School, Police Say

Gunman Shoots at Toronto U.S. Consulate, 2nd Time This Year

Vance, Thune Honor Lindsey Graham as American Original in Tributes

Bryan Kohberger Victim’s Dad Wants Retrial for Justice

Chicago Cleans Up as NWS Surveys Tornado-Warned Storm Damage

NateDeBaiits’ Fatal California Birthday Party Captured on Livestream Video

Woman Wins Major Settlement After Bladder Mistakenly Removed During Benign Cyst Surgery

Big Bear Bald Eagle Jackie Improves After Setback, Remains Critical at Ojai Raptor Center

Why Was Berlin Attack Suspect Free Despite Alleged ISIS Ties?

Trump Holds Separate Talks With Zelenskyy and Netanyahu Amid Iran and Ukraine Wars

Heroic 16-Year-Old Lifeguard Saves Child From Monster Surf in Dramatic Rescue Video

Ruby Rippey Disputes Gavin Newsom’s Affair Claims in Essay

CNN Accused of Banning Talk of Fauci Diaries After Praise

Father of Idaho Murders Victim Says Bryan Kohberger Deserves a Retrial

Nolan Wells Disappearance Deepens After Email Claims Teen Drank Heavily Before Vanishing

US-Iran Deal Still Out of Reach as Both Sides Remain Far Apart

Are you still smarter than an AI? There’s a way to keep track

Up next

Author

Internewscast

Share article

Leave a Reply Cancel reply

You May Also Like