Skip to main content
← Back to writing

When SonarQube Reports "Security: 0": Evaluating AI-Native Code Scanning

TL;DR: I ran OpenAI’s Codex Security against the same codebases our deterministic SAST already watched. SonarQube’s dashboard said “Security: 0.” The AI scanner found 29 validated findings on one project alone — committed deployment credentials, stored XSS in payment flows, and a path to privileged CI/CD. The catch: it’s non-deterministic, slow, and token-hungry. It’s a powerful complement to SAST, not a replacement — and the operating model matters more than the tool.

🎯 The Problem: A Blind Spot We Couldn’t Measure

We build and maintain DXP and commerce platforms that handle donations, payments, and PII. Our quality gate — SonarQube — is excellent at code quality and reliability. But it kept reporting zero security vulnerabilities on repositories I had a bad feeling about.

The uncomfortable question wasn’t “are we secure?” It was “how big is the gap between what our tooling sees and what’s actually there?” — and a rule-based scanner can’t answer that, because it doesn’t know what it doesn’t know.

So I spent ~12 hours trialing an AI-native scanner whose entire premise is reasoning across boundaries that pattern-matching rules can’t cross.


🧠 What “AI-Native” Actually Means Here

This is the distinction that matters: Codex Security is AI-native, not AI-assisted. A reasoning LLM does the threat modeling, the vulnerability discovery, the validation, severity calibration, and remediation proposals. The deep-scan mode fans out to multiple delegated workers — on one run it synthesized 24 independent worker threat models into a single scan context.

Every scan follows a staged pipeline, and crucially, it validates its own findings before reporting them:

flowchart LR
    A["Repo-specific<br/>threat modeling"] --> B["Finding discovery<br/>(source→sink dataflow)"]
    B --> C["Validation<br/>(reproduce + evidence)"]
    C --> D["Attack-path analysis<br/>(severity calibration)"]
    D --> E["Reporting<br/>(CWE, dataflow, fix)"]
    C -.->|rejected / deferred| F["Auditable<br/>'not-an-issue' list"]

That validation stage is what keeps false-positive noise low. Each finding ships with a dataflow narrative, a reachability assessment, a confidence level with rationale, CWE mapping, and concrete remediation. Just as important: it emits an explicit list of rejected and deferred surfaces, so coverage is auditable rather than a black box. Proposed patches are never auto-applied.


📊 The Result That Ended the Debate

I tested four ways across two codebases. The headline comparison against our SAST baseline:

xychart-beta
    title "Security findings on the same codebase"
    x-axis ["SonarQube (SAST)", "AI scan (standard)", "AI scan (deep)"]
    y-axis "Validated findings" 0 --> 32
    bar [1, 6, 29]

On the project above, the standard AI scan returned 6 findings (1 High, 5 Medium); the deep scan returned 29 (5 High, 22 Medium, 2 Low) — roughly 4.8× the recall, adding whole families the standard pass missed (GraphQL injection, JSON-LD </script> XSS, additional CMS sink classes, public-API abuse). All 6 standard findings were re-confirmed in the deep scan — none were lost.

Across both repos, SonarQube reported 1 security vulnerability against 19 validated AI findings (standard-mode basis) — with essentially zero overlap in either direction. That’s the part that mattered: the tools weren’t finding the same things at different sensitivities. Our existing tooling was structurally blind to these issue classes, not merely under-tuned.

The findings weren’t theoretical:

Finding classWhy rule-based SAST missed it
Committed webhook credential → privileged deployment infraRequires reasoning from a secret in source across to CI/CD reachability
~10 committed vendor credentials (analytics, CMS, etc.)Secrets hygiene spanning config + serialization files
Stored XSS in donation/payment + CMS renderingSanitizer-adequacy judgment, not signature matching
SSRF-capable image-optimizer configConfig semantics (e.g. browser-exposed env vars)

Coverage was broader in a second dimension too: the scan analyzes the full dependency tree (node_modules), not just what’s declared in package.json.


⚖️ Where It Falls Down

An honest evaluation has to lead with the weaknesses, because they define the operating model.

Non-determinism is real

Run the same scan on the same code and severities drift:

  • SSRF rated Medium one run, Low the next
  • A leaked secret rated Medium, then High
  • The standard scan cleared two surfaces as “No issue found” that the deep scan reported as findings two days later

Consequence: it cannot be a pass/fail CI gate. A Medium that becomes a Low on re-run breaks any threshold-based pipeline. You have to anchor severity on the dataflow evidence, not the run-specific label.

It is not a superset of SAST

Codex missed the one Blocker SonarQube did catch (a client-side code-execution issue). The lesson is not “switch” — it’s layer.

Deterministic SASTAI-native scanning
Determinism✅ Stable, gate-able❌ Drifts between runs
Cross-boundary dataflow❌ Limited✅ Strong
Config/semantic reasoning❌ No✅ Yes
Code-quality coverage✅ Thousands of rules❌ None
Speed✅ Seconds–minutes❌ Hours (deep)
Validated, narrated findings⚠️ Varies✅ Per-finding evidence

It’s slow and token-heavy

A deep scan ran ~3.5 hours wall-clock and was expensive enough that I had to request a capacity increase twice during the trial (~6,000 credits for two deep scans and one soft scan). Findings also frequently carry unproven preconditions — credential liveness, backend controls — flagged honestly as untested rather than asserted, but requiring human follow-up.


🛠️ Operationalizing It: A Two-Tier Cadence

The tool only becomes useful if it maps onto how a team actually works. The two scan modes map cleanly onto two cadences:

ModeWall-clockRole
Deep scan~3.5 h, exhaustiveMandatory gate for pre-launch and new-build projects; quarterly audits
Soft scan~30 min, stable coreMonthly sweep on ongoing/managed engagements
Diff scanper-PRReview of security-sensitive changes (auth, payments, CMS rendering)

Beyond cadence, adopting it meant committing to real process changes:

  • Client authorization in SOWs before running AI scanning on client-owned code.
  • A triage rotation + SLA — a deep scan produces 20–30 items that need an owner.
  • A severity-normalization policy anchored on evidence, since labels drift.
  • Credential-rotation runbooks — every scan of legacy code surfaces committed secrets, and the moment a scan report exists, those secrets must be treated as exposed.
  • Keep SonarQube in place as the deterministic quality / CI-gate layer. The tools proved complementary.

⚠️ The report is the most sensitive artifact you now own. A scan report enumerates exploitable paths and quotes committed credentials verbatim. It must never be attached to a general ticket or shared outside the engagement team without redaction. Store reports in access-controlled locations and rotate every credential they name.


🔐 Privacy & Data Handling

Worth being precise about, since this reasons over client source:

  • What it processes: full source of scanned repos (cloud scans clone temporarily; the plugin reads the local checkout), commit history, and generated artifacts.
  • Isolation: each analysis/validation job runs in an ephemeral, isolated container with session-scoped tools; the clone is temporary and the container is torn down after the job. Transport is TLS-encrypted.
  • Open items before broad rollout: confirm data-handling terms for client code (training exclusion, retention, residency), and gate cloud access behind workspace admin controls.

✅ The Verdict

Strongly favorable — provided it’s layered on top of, not substituted for, existing tooling.

The business case is one prevented incident. The trial alone surfaced a CI/CD takeover path, ~10 credentials needing rotation, and three High-severity XSS classes in payment flows — any one of which, exploited in production, would cost more in incident response than years of tool spend.

But the real shift isn’t the tool — it’s the methodology. AI-native scanning moves security from a point-in-time checklist to an evidence-driven, continuous practice: deep scans as a launch gate, diff scans in the PR flow, “no-issue” results tracked as open surfaces rather than clearances, and secrets discipline enforced because the scanner reliably catches violations.

SonarQube saying “Security: 0” was never reassuring. Now I know exactly how much it wasn’t telling me.