close
close

OpenAI models trained on over a million hours of YouTube videos: Report | Technical news

According to a recent report by The New York TimesSome of the biggest tech giants have used YouTube video transcripts to train their powerful AI language models – potentially violating the copyrights of their creators.

The story claims that OpenAI used its speech recognition tool Whisper to transcribe over a million hours of YouTube content. These transcripts were then fed as training data into GPT-4, the AI ​​model that powers ChatGPT Plus.

OpenAI is not the only one accused of this YouTube data mining. The report claims that there were teams at Google doing the same thing, mining YouTube videos to build datasets for their own large language models like Bard/Gemini. A Google spokesperson admitted to the publication that “unauthorized scraping or downloading of YouTube content” violates their policies.

However, the report suggests that Google may have turned a blind eye to OpenAI's YouTube transcript theft because the company itself was doing similar things. Allegedly, Google knew what OpenAI was up to, but didn't object because the company also used YouTube data to train its AI.

Both companies reportedly found themselves at the limit of the amount of useful training data they could find from more conventional sources such as books, websites and databases. For example, OpenAI has already used up useful resources in 2021. Therefore, these companies began to explore new data streams such as videos and podcasts.

Festive offer

Google even reportedly changed its data policies in July last year to expand what it can do with consumer data, including tools like Google Docs.

OpenAI and Google have defended their practices, claiming that they only use public data or content when they have permission to do so. However, the allegations raise some thorny questions about fair use, copyright and privacy.

After all, most YouTube creators probably didn't expect that their videos could be transcribed without their knowledge. It shows that in the race for AI supremacy, big tech companies are comfortable cutting corners to satisfy the immense appetites of large language models.

© IE Online Media Services Pvt Ltd

First uploaded on: 04/09/2024 at 12:17 IST