Gemini Leaks Real Phone Numbers

San FranciscoOnce in a deployed model, real phone numbers cannot be removed even when owners ask.

Gemini treats scraped contact strings as valid answers to business queries, misdirecting strangers’ calls to unconsenting people, privacy researchers warn.

Google has announced no opt-out mechanism.

Sources: MIT Technology Review

Background

WHY MODELS MEMORIZE

Large language models are trained to predict the next token across billions of documents. Strings that appear even a few times — phone numbers, addresses, API keys — can be encoded directly into the weights as high-probability completions for nearby context. Memorization is not a bug; it is the same mechanism as learning.

THE NO-RECALL PROBLEM

Once a number is baked into the weights, it cannot be surgically removed. Weights are billions of floating-point numbers entangled across every concept the model knows. There is no index from 'this fact' to 'these parameters' — deletion would require retraining from a scrubbed corpus, which costs tens of millions of dollars per run.

WHY HALLUCINATION FEELS REAL

When asked for a business's phone number, the model retrieves the most statistically plausible 10-digit string that co-occurs with that business name in training data. If a real person's number appeared near the business name once — in a scraped forum, an old directory — that number becomes the confident answer. The model has no concept of 'mine' versus 'someone else's'.

THE GDPR COLLISION

EU privacy law grants a 'right to erasure' — individuals can demand their personal data be deleted from any system that holds it. LLMs were not designed with this right in mind. Regulators in Italy, France, and Germany have opened cases asking whether training on personal data without a deletion mechanism is lawful at all.

THE SCRAPE-FIRST PRECEDENT

The major foundation models — Gemini, GPT, Claude, Llama — were trained on web crawls that swept up phone numbers, home addresses, and private emails posted anywhere public. Consent was never asked. The industry's working assumption: if it was reachable by a crawler, it was fair game. This assumption is now being tested in courts on three continents.

THE STRUCTURAL FIX

Two approaches are being researched: differential privacy during training (adding noise so no single example can be reconstructed) and machine unlearning (algorithms that approximate the effect of removing data without full retraining). Neither is production-ready at frontier-model scale. For now, the harm runs faster than the remedy.