Tech
Robot Among City Ruins
(CSA Images)

How well can top AI models do these jobs?

An OpenAI benchmark tests how well AI models can perform “economically valuable” jobs.

One of the biggest fears fueling the public’s apprehension toward AI is that the technology will eventually take their jobs.

We’ve already seen evidence that some roles like entry-level software development, customer service, and marketing are feeling the effects of automation powered by generative AI. Being able to track the real-world work capabilities of AI models will become increasingly important as models get more and more powerful.

To that end, OpenAI has created a new AI benchmark called “GDPval” that aims to measure just how well leading AI models can do realistic tasks for a variety of “economically valuable” jobs.

OpenAI describes the benchmark as an evolutionary step away from the first wave of benchmarks that followed a more academic, exam-style model:

“[GDPval] measures model performance on tasks drawn directly from the real-world knowledge work of experienced professionals across a wide range of occupations and sectors, providing a clearer picture on how models perform on economically valuable tasks. Evaluating models on realistic occupational tasks helps us understand not just how well they perform in the lab, but how they might support people in the work they do every day.”

Working with experienced industry professionals, the researchers created a dataset of 220 realistic tasks from 44 occupations that someone might do in the course of their work in a particular role.

Here’s an example of one of the tasks in the benchmark’s training data for a real estate broker:

Screenshot 2025-09-26 at 3.41.51 PM
Sample task for a real estate broker from the GDPval benchmark’s training dataset (Huggingfacce.co)

We went through the data and picked a few common jobs from the benchmark’s results. Unsurprisingly, software developers were the most impacted job, with Anthropic’s Claude model getting an average 70% win rate on the test, which was then compared to a human in that role. For example, a score of 50% would put the model on par with a human expert. Audio and video technicians should feel that their job is secure (for now), as the models executed those tasks with very low scores.

OpenAI acknowledges there are limitations with this benchmark. For instance, currently, each task comes with some background materials that are required to do the task — but generating those background materials itself requires complex work and the benchmark doesn’t assess current models’ ability to complete those necessary preparatory tasks. Instead that work is done by the humans testing the AI. The paper also notes that this is a small dataset, and the current jobs tested are mainly those of “knowledge workers” that can be performed on a computer.

Maybe a future version will be used to test how well a robot can scrub your toilet.

More Tech

See all Tech
Two Cat Businessmen Holding Drinks

The most outlandish tech CEO quotes from 2025

Tech CEOs have been nuttier than ever.

tech

Trump AI executive order is a “major win” for Open AI, Google, Microsoft, and Meta, says Ives

President Trump’s new executive order aiming to keep states from enacting AI laws that inhibit US “global AI dominance” is a “major win” for OpenAI, Google, Microsoft, and Meta, according to Wedbush Securities analyst Dan Ives. Big Tech companies have collectively plowed hundreds of billions into the technology, while seeing massive stock price gains, and Ives believes they stand to gain much more.

“Given that there have been over 1,000 AI laws proposed at the state level, this was a necessary move by the Trump Administration to keep the US out in front for the AI Revolution over China,” Ives wrote, adding that state-by-state regulation “would have crushed US AI startup culture.” The presidential order would withhold federal funds from states that put in place onerous AI regulations.

This morning, Whitehouse AI adviser Sriram Krishnan said in a CNBC interview that he’d be working with Congress on a single national framework for AI.

Despite Ives’ rosy read-through on the order, with the exception of Nvidia, which jumped on a report of boosted Chinese demand, many AI stocks are in the red early today. The VanEck Semiconductor ETF is down nearly 1% premarket, as the AI trade struggles thanks to underwhelming earnings results from Oracle earlier this week.

“Given that there have been over 1,000 AI laws proposed at the state level, this was a necessary move by the Trump Administration to keep the US out in front for the AI Revolution over China,” Ives wrote, adding that state-by-state regulation “would have crushed US AI startup culture.” The presidential order would withhold federal funds from states that put in place onerous AI regulations.

This morning, Whitehouse AI adviser Sriram Krishnan said in a CNBC interview that he’d be working with Congress on a single national framework for AI.

Despite Ives’ rosy read-through on the order, with the exception of Nvidia, which jumped on a report of boosted Chinese demand, many AI stocks are in the red early today. The VanEck Semiconductor ETF is down nearly 1% premarket, as the AI trade struggles thanks to underwhelming earnings results from Oracle earlier this week.

tech
Rani Molla

Epic scores two victories as “Fortnite” returns to Google Play and appeals court keeps injunction against Apple

“Fortnite” maker Epic Games notched two wins Thursday in its drawn-out battle against Big Tech’s app stores. “Fortnite” returned to the Google Play app store in the US, Reuters reports, as Epic continues working with Google to secure court approval for their settlement.

Meanwhile, a US appeals court partly reversed sanctions against Apple in Epic’s antitrust case, calling parts of the order overly broad, but upheld the contempt finding and left a sweeping injunction in place — keeping pressure on Apple to allow developers to steer users to outside payment options and reduce its tight control over how apps can communicate and monetize on iOS.

tech
Jon Keegan

Report: AI-powered toys tell kids where to find matches, parrot Chinese government propaganda

You may want to think twice before buying your kids a fancy AI-powered plush toy.

A new report from NBC News found that several AI-powered kids toys could easily be steered to dangerous as well as sexually explicit conversations in a shocking demonstration of the loose safety guardrails in this novel category of consumer electronics.

A report out by the Public Interest Research Group details what researchers found when they tested five AI-powered toys for kids bought from Amazon. Some of the toys offered instructions on where to find matches and how to start fires.

NBC News also bought some of these toys and found they parroted Chinese government propaganda and gave instructions for how to sharpen knives. Some of the toys also discussed inappropriate topics for kids, like sexual kinks.

The category of AI-powered kids toys is under scrutiny as major AI companies like OpenAI have announced partnerships with toy manufacturers like Mattel (which has yet to release an AI-powered toy).

A report out by the Public Interest Research Group details what researchers found when they tested five AI-powered toys for kids bought from Amazon. Some of the toys offered instructions on where to find matches and how to start fires.

NBC News also bought some of these toys and found they parroted Chinese government propaganda and gave instructions for how to sharpen knives. Some of the toys also discussed inappropriate topics for kids, like sexual kinks.

The category of AI-powered kids toys is under scrutiny as major AI companies like OpenAI have announced partnerships with toy manufacturers like Mattel (which has yet to release an AI-powered toy).

tech
Jon Keegan

OpenAI releases GPT-5.2, the “best model yet for real-world, professional use”

After feeling the heat from Google’s recent launch of its powerful Gemini 3 model, OpenAI’s response to its “code red” has been released, reportedly on an accelerated schedule to keep up with the competition.

The company’s new flagship model, GPT-5.2, is out, and the company is calling it “the most capable model series yet for professional knowledge work.”

OpenAI CEO Sam Altman called it the “smartest generally-available model in the world” and shared benchmarks that showed it achieving higher scores than Gemini 3 Pro and Anthopic’s Claude Opus 4.5 in some software engineering tests and abstract reasoning, math, and science problems.

In a press release announcing the new model, the company said: “Overall, GPT‑5.2 brings significant improvements in general intelligence, long-context understanding, agentic tool-calling, and vision — making it better at executing complex, real-world tasks end-to-end than any previous model.”

OpenAI CEO Sam Altman called it the “smartest generally-available model in the world” and shared benchmarks that showed it achieving higher scores than Gemini 3 Pro and Anthopic’s Claude Opus 4.5 in some software engineering tests and abstract reasoning, math, and science problems.

In a press release announcing the new model, the company said: “Overall, GPT‑5.2 brings significant improvements in general intelligence, long-context understanding, agentic tool-calling, and vision — making it better at executing complex, real-world tasks end-to-end than any previous model.”

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, or Robinhood Money, LLC.