Mythos Flags 271 Firefox Bugs

San Francisco2 months, 271 bugs, almost no false positives.

The key is a custom harness giving Anthropic’s Mythos Firefox’s crash-test pipeline and a second language model to grade each report.

Mozilla published 12 of the 271 bug reports; critics question whether cherry-picking conceals less accurate results.

Sources: Ars Technica

Background

THE FUZZING TRADITION

Browsers have been fuzzed — bombarded with malformed inputs to trigger crashes — since the 1990s. Mozilla's jsfunfuzz, written by Jesse Ruderman in 2007, found over 1,700 JavaScript engine bugs before AI was anywhere near the problem. Crash harnesses are the existing rails an LLM bug-hunter rides on; the model proposes inputs, the harness executes them in instrumented builds and reports what broke.

WHY FALSE POSITIVES KILL SECURITY TOOLS

Static analyzers and old-school pattern scanners routinely produced 90%+ false-positive rates. Each bogus report costs an engineer 15-60 minutes of triage. Coverity's commercial breakthrough in the mid-2000s was not finding more bugs but lowering the false-positive rate enough that engineers stopped ignoring the queue. The 271-bug, near-zero-false-positive claim is exactly this metric, restated for the LLM era.

LLM-AS-JUDGE

Using a second language model to grade the first model's output is a well-established technique — it works because verification is often easier than generation. For a crash report, the judge model checks whether the proof-of-concept actually reproduces, whether the stack trace matches the claimed bug class, and whether the report duplicates an existing one. This is the same architecture OpenAI used for RLHF reward modeling and Anthropic uses for Constitutional AI.

THE MEMORY-SAFETY SUBSTRATE

Roughly 70% of severe browser vulnerabilities — across Firefox, Chrome, and Safari — are memory-safety bugs in C and C++ code: use-after-free, out-of-bounds reads, type confusion. This is why Mozilla has been migrating Firefox internals to Rust since 2016 and why Google launched its own Rust-in-Chromium effort. AI bug-finders are mining a vein that exists because the underlying language is unsafe.

THE DISCLOSURE QUESTION

Responsible disclosure norms — codified by CERT/CC in the 1990s — give vendors a fixed window (typically 90 days) to patch before details go public. Publishing 12 of 271 reports is not a violation of those norms; it is a marketing choice. The substantive worry critics raise is selection bias: 12 cherry-picked wins do not let outsiders estimate the true precision of the pipeline.