Tech
tech
Jon Keegan

If AI models can ace every test, it’s actually not a good thing

AI companies are eager to show how much “smarter” and more capable their latest large language models are. To highlight these improvements, companies point to scores on widely used standardized tests like HellaSwg or MMLU as AI benchmarks to flex on how well the tools can write code or solve problems.

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

More Tech

See all Tech
tech

Lyft and Uber jump after announcing expanded robotaxi partnerships with Nvidia

Uber and Lyft both announced expanded AI and autonomous vehicle partnerships with Nvidia at the company’s GTC event, sending both ride-hailing stocks up after-hours.

Uber was recently up 3.3%, while Lyft rose 3%.

Uber said Nvidia-powered Level 4 robotaxis will launch on its platform in Los Angeles and San Francisco in 2027, with plans to scale to 28 cities globally by 2028. Meanwhile, Lyft said it will use Nvidia’s AI infrastructure to improve ride-matching, mapping, and efficiency, while also using Nvidia’s DRIVE Hyperion platform as a foundation for future autonomous fleets.

Separately, Nvidia announced expanded autonomous driving partnerships with Kia and Hyundai.

The announcements highlight Nvidia’s growing push to provide the AI hardware and software powering next-generation robotaxi networks — packaging the technology needed for self-driving cars into a platform that other companies can use to compete with Tesla.

15

Tesla’s Robotaxi program has disclosed its 15th accident, Electrek reports, citing the latest filing from the National Highway Traffic Safety Administration. According to Electrek’s estimation, extrapolated from the last time Tesla disclosed mileage figures, that amounts to a crash every 57,000 miles — about 9x the rate for humans.

The latest crash involved a Model Y hitting a fixed object at 9 mph in January while the autonomous system was engaged.

Humans are very much still involved with Tesla’s so-called autonomous driving service. Despite the service announcing in January that it had started removing safety monitors from the front seats, only two unsupervised vehicles have been spotted in the last month, per Robotaxi Tracker. The entire fleet has also dwindled from around 50 vehicles to just 35. Their mileage is unavailable.

tech

Meta’s reported 20% layoff could bring headcount to its lowest level since 2021

Meta is rising Monday morning after Reuters reported the tech giant is planning to lay off 20% of its employees in an effort to use AI to make its workforce more efficient and offset its surging AI capex costs.

On the company’s last earnings call, CEO Mark Zuckerberg touted 30% efficiency gains for its software engineers and said some “power users” of the company’s AI coding tools saw productivity jump as high as 80% — what some saw as a veiled threat to employees who failed to use AI to boost their output.

Meta’s headcount was nearly 79,000 last quarter, having steadily risen since its layoffs during the self-described “year of efficiency” in 2023. A 20% cut would bring headcount to around 63,000 — the company’s lowest level since 2021.

Shares were recently up 2.7%.

Meta’s headcount was nearly 79,000 last quarter, having steadily risen since its layoffs during the self-described “year of efficiency” in 2023. A 20% cut would bring headcount to around 63,000 — the company’s lowest level since 2021.

Shares were recently up 2.7%.

tech

Report: Amid safety failures, ChatGPT’s planned “adult mode” caused concern within OpenAI, with minors misclassified as adults 12% of the time

Despite a series of alarming mental health safety failures that resulted in ChatGPT users allegedly using the product to plan suicides and murder, OpenAI decided to double down on its plan to roll out an “adult mode,” allowing the AI chatbot to produce erotic content.

That decision raised alarms within the company, warning that users could develop unhealthy emotional dependence on the chatbot and that the new age estimation feature was imperfect — and therefore likely to allow minors to access the feature — according to a new report from The Wall Street Journal. Per the report, some 12% of the time, the age estimation feature mistakenly classified minors as adults.

OpenAI’s council of mental health experts were “furious” and unanimous in their opposition to the plans to move forward with the adult mode feature after they were told about the decision in January, with concerns about creating a “sexy suicide coach.”

Earlier this month, the company said it would delay the new feature to focus on other products.

That decision raised alarms within the company, warning that users could develop unhealthy emotional dependence on the chatbot and that the new age estimation feature was imperfect — and therefore likely to allow minors to access the feature — according to a new report from The Wall Street Journal. Per the report, some 12% of the time, the age estimation feature mistakenly classified minors as adults.

OpenAI’s council of mental health experts were “furious” and unanimous in their opposition to the plans to move forward with the adult mode feature after they were told about the decision in January, with concerns about creating a “sexy suicide coach.”

Earlier this month, the company said it would delay the new feature to focus on other products.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, Robinhood Derivatives, LLC, or Robinhood Money, LLC. Futures and event contracts are offered through Robinhood Derivatives, LLC.