The problem is that the types of data commonly used to train language models could be used in the near future — as early as 2026, according to a paper by researchers at Epoch, a yet-to-be-peer-reviewed AI research and prediction organization. The problem arises because as researchers build more powerful models with greater capabilities, they need to find more and more text to train on. Researchers of large language models are increasingly concerned that they will run out of this kind of data, says Teven Le Scaau, a researcher at the artificial intelligence company Hugging Face, who was not involved in the Epoch work.
Part of the problem is that AI researchers filter the data they use to train models into two categories: high quality and low quality. The line between the two categories can be blurred, says Pablo Villalobos, an Epoch staff writer and lead author of the article, but text from the former is considered better written and is often written by professional writers.
Data from low-quality categories consist of texts such as social media posts or comments on websites such as 4chan, and far outnumber data considered high-quality. Researchers typically only train models using data that falls into the high-quality category because that’s the type of language they want the models to reproduce. This approach has produced impressive results for large language models such as GPT-3.
One way to overcome these data limitations could be to redefine what is defined as “low” and “high” quality, according to Swabha Swayamdipta, a professor of machine learning at the University of Southern California who specializes in dataset quality. If the lack of data prompts AI researchers to incorporate more diverse datasets into the training process, that will be a “net positive” for language models, says Swayamdipta.
Researchers can also find ways to extend the life of data used to train language models. Currently, large language models are trained on the same data only once due to performance and cost limitations. But you can train a model multiple times using the same data, says Swayamdipta.
Some researchers believe that bigger may not be better when it comes to language patterns. Percy Liang, a professor of computer science at Stanford University, says there is evidence that making models more efficient can improve their capabilities, not just increase their size.
“We saw how smaller models trained on higher quality data can outperform larger models trained on lower quality data,” he explains.