THE MECHANISM
Large language models are trained to predict the most probable next token. Averaged across billions of examples, the most probable choice is by definition the most common one. The model's natural attractor is the median of its training data — not the mean of human writing, but the mode.
THE SHARED CORPUS
The major frontier labs train on overlapping slices of the same internet — Common Crawl, Wikipedia, Reddit, Stack Exchange, books from the same shadow libraries. Different architectures fed the same diet converge on similar outputs. The 'house style' is the internet's center of gravity, not any one company's voice.
RLHF NARROWS FURTHER
After pretraining, models are tuned with reinforcement learning from human feedback. A few thousand contractors rate outputs; the model learns what those raters reward. Polite hedging, balanced both-sides framing, the em-dash transition, the 'It's not just X — it's Y' construction: these are not invented by the model, they are amplified by the rating pool's preferences.
THE TELLS
Audiences trained to spot AI now flag specific tics: em-dashes between independent clauses, tricolon openers, 'delve' and 'tapestry,' the antithesis pivot ('not merely X, but Y'), reflexive caveats. None of these are AI inventions — they are register markers of formal English the model overuses.
THE FEEDBACK LOOP
AI output is now a meaningful share of new text on the web. Next-generation models train on a corpus partly written by previous-generation models. Researchers call this model collapse — the variance of the training distribution shrinks with each cycle, and the median tightens around itself.
THE HUMAN COST
When a style becomes the AI style, humans who naturally write that way get accused of using AI. Non-native English writers, who learned formal register from textbooks, get flagged more often than native speakers writing casually. AI-detection tools have documented false-positive rates above 50% on non-native prose.