We've passed 'peak LLM'
THE VIRGIN CORPUS AND THE CASE FOR SPECIALIZED AI
We may have passed “peak LLM” and not even noticed. It all feels like each conversation with the bots is more a constant retread, that they are all landing in this nebulous space where they all sound the same. I see posts like, “Have you noticed how generic all the answers are on ChatGPT?” Or homogenized output from the visual models? While it may feel like things are getting better or smoother, those rough edges may have been what was best about the technology and this “smoothing down” may be a klaxon for something more insidious.
“Hey Steve… I like pizza” -Multiplicity or what happens when you make a copy of a copy of a copy

What we are seeing is model collapse. The research is out there and it is plain. Training models on synthetic data degrades their performance and results. Why are the frontier labs training on synthetic data? Most of them have trained on just about everything they can get their hands on. There’s a massive amount of information needed to build a functional and “reasoning” LLM. It has to know all the words, how they’ve been written, and then context from all sources. Once it has this, it can Frankenstein’s monster words and phrases together using training and luck. Once we start adding in output that is generated based on training and luck, we see: “results show that even the smallest fraction of synthetic data (e.g., as little as 1 per 1000) can still lead to model collapse: larger and larger training sets do not enhance performance.” (https://openreview.net/forum?id=et5l9qPUhm) Or: “We discover that indiscriminately learning from data produced by other models causes ‘model collapse’—a degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time.” https://www.nature.com/articles/s41586-024-07566-y
It’s not like this is a new thing; we’ve known about model collapse for the last 3 or 4 years. In 2022, DeepMind’s Jordan Hoffman and team figured out how much data we need to train relative to model size, called the “Chinchilla scaling laws”. Their rule was that each model parameter needed about 20 tokens of training data. For example, a 70B model would need 1.4 trillion tokens to be properly trained. This means that the foundation models with 300B parameters would require more text than humanity has written. This is the pickle. There’s a limit with a large model to the amount of fresh, real, new human data that can be trained on. That’s the ‘ceiling,” and there’s the oncoming collapse from the use of synthetic data from “below.” Powerful forces are each coming for your favorite LLM.
What if there was a way to get ahead of this by going backwards? Sort of a Svalbard seed vault analogy. Building a model on pre-slop internet content and whatever else, human-created that we can ethically source with C2PA and cryptographic signing. Locking that off as a “Virgin Corpus concept.” Adding to the training could come from authenticated human-derived works as they are created, but not going for scale, scale, scale, because the models aren’t going to get any better at language beyond a certain point. This doesn’t obviate the need to have AI systems that use different methodologies to achieve amazing things for humanity. Coding, logic, extended thinking and context windows, medical imaging, math, and science. All of those can be dedicated sub-systems that don’t rely on larger and larger models.
DeepSeek-V3, Mixtral, Llama 4 are all using something called Mixture of Experts or MoE. This is about using a subset of the model for each inference to get a “specialist” to answer a question. Using a fraction of the large model uses less compute, which means less power, less cost, and possibly a faster result. This is the right direction, but it still uses a large model like DeepSeek-V3 that has 671B parameters, that if we used the Chinchilla Scaling Laws would need 13.5 trillion tokens to properly train. That’s far outside the human content that it has access to. Which means, this is a solution, right now, to address power, speciality, and speed, but all of the specialists are trained on that same model, and that model is what it is, there’s no real way to grow it without needing to use synthetic data… Model Collapse. Bigger isn’t the answer.
We should collectively be looking at specialization at the model level. Let’s look at it like C3-PO and R2-D2. If we take the principles of MoE, there’s a language model on top, C3-PO. This can be big and generalized, able to comprehend and direct traffic, able to communicate from the sub-systems, R2-D2. This is where we move to a new framework of having smaller specialized models. They can use the Virgin Corpus, but then train additionally on image detection, molecular biology, law, or any number of specialty systems that can be refined, that can grow in their own space without needing a constantly expanding general trillion-parameter model built on synthetic data. This has benefits for power and compute, as well as a lesser need for RAM to hold an entire model in memory.
The money right now is going into scale. Scale is finite, and there isn’t a happy ending. We need to critically think about how to divert our investments into a smarter strategy. There is so much money tied up in scale, but the “Step 2. Scale” won’t give us a “Step 3. Profit.” A smarter architecture, using specialized systems, uncontaminated by slop, and trained extensively on their own specialities, would help make this something that travels into the future. Instead of a future with underperforming, expensive chatbots, specialization can endlessly get better. AI can solve diseases, rapidly create vaccines, study climate, or do high-level mathematical theory. It can specialize in nearly any chosen field at substantially less cost.
I’m a VFX supervisor. I’m not anti-AI, but I can see the leaks in the dam from the user level, and when I research why I’m seeing this, I can clearly see the data. I’m sure a lot is going on out there in AI development, inside the frontier labs or even the smaller labs working on bespoke models. I can see that the current plan of “hyper scaling ‘til you make it” is fraught with pitfalls and dead ends. As I said, we may have passed up Peak LLM, and in our excitement, not even noticed. Now is the time to examine other methods of searching for answers without using synthetic data to hyper scale new models. Let's change the world...
From Notebook LLM, a podcast about this.