Tech
tech
Rani Molla

Meta’s not telling where it got its AI training data

Today Meta unleashed its ChatGPT competitor, Meta AI, across its apps and as a standalone. The company boasts that it is running on its latest, greatest AI model, Llama 3, which was trained on “data of the highest quality”! A dataset seven times larger than Llama2! And includes 4 times more code!

What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Meta, of course, isn’t the only company that’s tight-lipped about where its AI data is coming from. In a now infamous interview with WSJ’s Johanna Stern, OpenAI’s chief technology officer Mira Murati was unable to answer questions about what Sora, OpenAI’s video generating app, was trained on. YouTube? Facebook? Instagram — she said she wasn’t sure.

What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Meta, of course, isn’t the only company that’s tight-lipped about where its AI data is coming from. In a now infamous interview with WSJ’s Johanna Stern, OpenAI’s chief technology officer Mira Murati was unable to answer questions about what Sora, OpenAI’s video generating app, was trained on. YouTube? Facebook? Instagram — she said she wasn’t sure.

More Tech

See all Tech
tech

Google rolls out Private AI Compute matching Apple’s AI privacy scheme

One of the barriers to people embracing AI in their daily lives is trust — making sure that the company that built the AI isn’t going to just spill your most sensitive info to advertisers and data brokers.

Google is announcing a new feature called Private AI Compute that takes a page from Apple to help assure users that Google will keep your AI data private.

In June 2024, Apple announced its Private Cloud Compute scheme, which ensures only the user can access data sent to the cloud to enable AI features.

While Apple’s AI tools have yet to fully materialize, Google’s new offering looks a lot like Apple’s. AI models on its phones process data in a secure environment, and when more computing is needed in the cloud, that security is extended to the cloud to be processed by Google’s custom TPU chips.

A press release said: “This ensures sensitive data processed by Private AI Compute remains accessible only to you and no one else, not even Google.”

In June 2024, Apple announced its Private Cloud Compute scheme, which ensures only the user can access data sent to the cloud to enable AI features.

While Apple’s AI tools have yet to fully materialize, Google’s new offering looks a lot like Apple’s. AI models on its phones process data in a secure environment, and when more computing is needed in the cloud, that security is extended to the cloud to be processed by Google’s custom TPU chips.

A press release said: “This ensures sensitive data processed by Private AI Compute remains accessible only to you and no one else, not even Google.”

315M

Amazon says it has 315 million monthly active viewers for its Prime Video ads, according to Deadline, up from 200 million in April 2024. The number comes just a week after Netflix said it had 190 million monthly active viewers.

The self-reported numbers have different methodologies. Netflix counts the number of ad-tier subscribers who’ve watched at least one minute of ads per month and multiplies that by its estimated household size. Amazon’s number represents an unduplicated average monthly active ad-supported audience across its programming from September 2024 through August 2025.

The services themselves also aren’t exactly comparable. Netflix charges $7.99 a month for its ad-supported tier, while Prime Video comes bundled as part of Amazon Prime — and now automatically comes with ads unless consumers pay an extra $2.99 per month to remove them.

1.6M

Chinese EV maker and Tesla competitor BYD could sell up to 1.6 million vehicles abroad next year, according to a new report by Citi published by Reuters. That’s potentially 60% more than the roughly 1 million vehicles BYD is expected to sell outside China this year. That’s also the same number analysts polled by FactSet expect Tesla to sell in total in 2025.

tech

Apple reportedly considers adding additional camera to iPhone Air and pushing next release to 2027

Apple is delaying its next iPhone Air to the spring of 2027, from the fall of 2026, as it potentially rejiggers the model to include a second camera lens, according to The Information. Consumers have largely overlooked Apple’s latest, thinnest phone, choosing instead to buy the standard and Pro models, thanks in part to the Air’s single camera and relatively weak battery life. The preference caused Apple to greatly scale back production for its Air model.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, or Robinhood Money, LLC.