The Big Y #205

Reaching Peak Data

Apr 08, 2024

Hi! 👋 Welcome to The Big Y!

Two things people can’t stop talking or complaining about: not enough compute and not enough human generated data left for LLMs to consume in the training phase.

When I was in school, the concept of peak oil often came up in the discussion around our world’s energy future. From Wikipedia, “Peak oil is the theorized point in time when the maximum rate of global oil production will occur, after which oil production will begin an irreversible decline.” Peak oil was supposed to happen between 1965 and 1971, but that didn’t happen, and new peak dates continue to shift with technological advances such as fracking and energy diversification such as solar and wind.

So, what does this have to do with data for generative AI? In the same way, there was a prediction of peak oil, we could predict a time of peak human data: The point in time when most of the human created data has already been used to train generative AI models and there are only scraps of genuine human knowledge left to be scraped.

We can extend the analogy that “data is the new oil” further. With technological innovation we can gather more data, for example with improved voice and video to text, we can scrape all the videos on the internet to produce new data (like OpenAI did with YouTube), similar to fracking for new oil deposits (probably not a sustainable choice). Alternatively, with technology like synthetic data generation we mimic the impact of renewable energy sources such as solar or wind farms on peak oil.

Allow me to push the analogy a little further. Copyrighted data getting used irresponsibly is like pollution being released, impacting many while creating long term externalities that don’t have any easy solution. Training a model on copyrighted, protected, or even private data creates liabilities that may be very difficult to resolve.

While we are concerned about running out of human generated data, there are still many technological advances to be made with data, compute, model sizes and more to get us cooler and better new models.

The Tidbit: The EU is working hard on getting its own LLMs and other generative models up and running. Silo AI is working to create an open LLM focused on the Nordic Languages (Danish, Norwegian, Swedish, Icelandic and Finnish) based on more specific language data. It is interesting to see Finnish get lumped into the Nordic languages as the roots are more Hungarian than Nordic.

Know someone who might enjoy this newsletter? Share it with them and help spread the word!

Thanks for reading! Have a great week! 😁

🎙 The Big Y Podcast: Listen on Spotify, Apple Podcasts, Stitcher, Substack