THE FUZZING TRADITION
Browsers have been fuzzed — bombarded with malformed inputs to trigger crashes — since the 1990s. Mozilla's jsfunfuzz, written by Jesse Ruderman in 2007, found over 1,700 JavaScript engine bugs before AI was anywhere near the problem. Crash harnesses are the existing rails an LLM bug-hunter rides on; the model proposes inputs, the harness executes them in instrumented builds and reports what broke.
WHY FALSE POSITIVES KILL SECURITY TOOLS
Static analyzers and old-school pattern scanners routinely produced 90%+ false-positive rates. Each bogus report costs an engineer 15-60 minutes of triage. Coverity's commercial breakthrough in the mid-2000s was not finding more bugs but lowering the false-positive rate enough that engineers stopped ignoring the queue. The 271-bug, near-zero-false-positive claim is exactly this metric, restated for the LLM era.
LLM-AS-JUDGE
Using a second language model to grade the first model's output is a well-established technique — it works because verification is often easier than generation. For a crash report, the judge model checks whether the proof-of-concept actually reproduces, whether the stack trace matches the claimed bug class, and whether the report duplicates an existing one. This is the same architecture OpenAI used for RLHF reward modeling and Anthropic uses for Constitutional AI.
THE MEMORY-SAFETY SUBSTRATE
Roughly 70% of severe browser vulnerabilities — across Firefox, Chrome, and Safari — are memory-safety bugs in C and C++ code: use-after-free, out-of-bounds reads, type confusion. This is why Mozilla has been migrating Firefox internals to Rust since 2016 and why Google launched its own Rust-in-Chromium effort. AI bug-finders are mining a vein that exists because the underlying language is unsafe.
THE DISCLOSURE QUESTION
Responsible disclosure norms — codified by CERT/CC in the 1990s — give vendors a fixed window (typically 90 days) to patch before details go public. Publishing 12 of 271 reports is not a violation of those norms; it is a marketing choice. The substantive worry critics raise is selection bias: 12 cherry-picked wins do not let outsiders estimate the true precision of the pipeline.