When SonarQube Reports "Security: 0": Evaluating AI-Native Code Scanning
TL;DR: I ran OpenAI’s Codex Security against the same codebases our deterministic SAST already watched. SonarQube’s dashboard said “Security: 0.” The AI scanner found 29 validated findings on one project alone — committed deployment credentials, stored XSS in payment flows, and a path to privileged CI/CD. The catch: it’s non-deterministic, slow, and token-hungry. It’s a powerful complement to SAST, not a replacement — and the operating model matters more than the tool.
🎯 The Problem: A Blind Spot We Couldn’t Measure
We build and maintain DXP and commerce platforms that handle donations, payments, and PII. Our quality gate — SonarQube — is excellent at code quality and reliability. But it kept reporting zero security vulnerabilities on repositories I had a bad feeling about.
The uncomfortable question wasn’t “are we secure?” It was “how big is the gap between what our tooling sees and what’s actually there?” — and a rule-based scanner can’t answer that, because it doesn’t know what it doesn’t know.
So I spent ~12 hours trialing an AI-native scanner whose entire premise is reasoning across boundaries that pattern-matching rules can’t cross.
🧠 What “AI-Native” Actually Means Here
This is the distinction that matters: Codex Security is AI-native, not AI-assisted. A reasoning LLM does the threat modeling, the vulnerability discovery, the validation, severity calibration, and remediation proposals. The deep-scan mode fans out to multiple delegated workers — on one run it synthesized 24 independent worker threat models into a single scan context.
Every scan follows a staged pipeline, and crucially, it validates its own findings before reporting them:
flowchart LR
A["Repo-specific<br/>threat modeling"] --> B["Finding discovery<br/>(source→sink dataflow)"]
B --> C["Validation<br/>(reproduce + evidence)"]
C --> D["Attack-path analysis<br/>(severity calibration)"]
D --> E["Reporting<br/>(CWE, dataflow, fix)"]
C -.->|rejected / deferred| F["Auditable<br/>'not-an-issue' list"]
That validation stage is what keeps false-positive noise low. Each finding ships with a dataflow narrative, a reachability assessment, a confidence level with rationale, CWE mapping, and concrete remediation. Just as important: it emits an explicit list of rejected and deferred surfaces, so coverage is auditable rather than a black box. Proposed patches are never auto-applied.
📊 The Result That Ended the Debate
I tested four ways across two codebases. The headline comparison against our SAST baseline:
xychart-beta
title "Security findings on the same codebase"
x-axis ["SonarQube (SAST)", "AI scan (standard)", "AI scan (deep)"]
y-axis "Validated findings" 0 --> 32
bar [1, 6, 29]
On the project above, the standard AI scan returned 6 findings (1 High, 5 Medium); the deep scan returned 29 (5 High, 22 Medium, 2 Low) — roughly 4.8× the recall, adding whole families the standard pass missed (GraphQL injection, JSON-LD </script> XSS, additional CMS sink classes, public-API abuse). All 6 standard findings were re-confirmed in the deep scan — none were lost.
Across both repos, SonarQube reported 1 security vulnerability against 19 validated AI findings (standard-mode basis) — with essentially zero overlap in either direction. That’s the part that mattered: the tools weren’t finding the same things at different sensitivities. Our existing tooling was structurally blind to these issue classes, not merely under-tuned.
The findings weren’t theoretical:
| Finding class | Why rule-based SAST missed it |
|---|---|
| Committed webhook credential → privileged deployment infra | Requires reasoning from a secret in source across to CI/CD reachability |
| ~10 committed vendor credentials (analytics, CMS, etc.) | Secrets hygiene spanning config + serialization files |
| Stored XSS in donation/payment + CMS rendering | Sanitizer-adequacy judgment, not signature matching |
| SSRF-capable image-optimizer config | Config semantics (e.g. browser-exposed env vars) |
Coverage was broader in a second dimension too: the scan analyzes the full dependency tree (node_modules), not just what’s declared in package.json.
⚖️ Where It Falls Down
An honest evaluation has to lead with the weaknesses, because they define the operating model.
Non-determinism is real
Run the same scan on the same code and severities drift:
- SSRF rated
Mediumone run,Lowthe next - A leaked secret rated
Medium, thenHigh - The standard scan cleared two surfaces as “No issue found” that the deep scan reported as findings two days later
Consequence: it cannot be a pass/fail CI gate. A Medium that becomes a Low on re-run breaks any threshold-based pipeline. You have to anchor severity on the dataflow evidence, not the run-specific label.
It is not a superset of SAST
Codex missed the one Blocker SonarQube did catch (a client-side code-execution issue). The lesson is not “switch” — it’s layer.
| Deterministic SAST | AI-native scanning | |
|---|---|---|
| Determinism | ✅ Stable, gate-able | ❌ Drifts between runs |
| Cross-boundary dataflow | ❌ Limited | ✅ Strong |
| Config/semantic reasoning | ❌ No | ✅ Yes |
| Code-quality coverage | ✅ Thousands of rules | ❌ None |
| Speed | ✅ Seconds–minutes | ❌ Hours (deep) |
| Validated, narrated findings | ⚠️ Varies | ✅ Per-finding evidence |
It’s slow and token-heavy
A deep scan ran ~3.5 hours wall-clock and was expensive enough that I had to request a capacity increase twice during the trial (~6,000 credits for two deep scans and one soft scan). Findings also frequently carry unproven preconditions — credential liveness, backend controls — flagged honestly as untested rather than asserted, but requiring human follow-up.
🛠️ Operationalizing It: A Two-Tier Cadence
The tool only becomes useful if it maps onto how a team actually works. The two scan modes map cleanly onto two cadences:
| Mode | Wall-clock | Role |
|---|---|---|
| Deep scan | ~3.5 h, exhaustive | Mandatory gate for pre-launch and new-build projects; quarterly audits |
| Soft scan | ~30 min, stable core | Monthly sweep on ongoing/managed engagements |
| Diff scan | per-PR | Review of security-sensitive changes (auth, payments, CMS rendering) |
Beyond cadence, adopting it meant committing to real process changes:
- Client authorization in SOWs before running AI scanning on client-owned code.
- A triage rotation + SLA — a deep scan produces 20–30 items that need an owner.
- A severity-normalization policy anchored on evidence, since labels drift.
- Credential-rotation runbooks — every scan of legacy code surfaces committed secrets, and the moment a scan report exists, those secrets must be treated as exposed.
- Keep SonarQube in place as the deterministic quality / CI-gate layer. The tools proved complementary.
⚠️ The report is the most sensitive artifact you now own. A scan report enumerates exploitable paths and quotes committed credentials verbatim. It must never be attached to a general ticket or shared outside the engagement team without redaction. Store reports in access-controlled locations and rotate every credential they name.
🔐 Privacy & Data Handling
Worth being precise about, since this reasons over client source:
- What it processes: full source of scanned repos (cloud scans clone temporarily; the plugin reads the local checkout), commit history, and generated artifacts.
- Isolation: each analysis/validation job runs in an ephemeral, isolated container with session-scoped tools; the clone is temporary and the container is torn down after the job. Transport is TLS-encrypted.
- Open items before broad rollout: confirm data-handling terms for client code (training exclusion, retention, residency), and gate cloud access behind workspace admin controls.
✅ The Verdict
Strongly favorable — provided it’s layered on top of, not substituted for, existing tooling.
The business case is one prevented incident. The trial alone surfaced a CI/CD takeover path, ~10 credentials needing rotation, and three High-severity XSS classes in payment flows — any one of which, exploited in production, would cost more in incident response than years of tool spend.
But the real shift isn’t the tool — it’s the methodology. AI-native scanning moves security from a point-in-time checklist to an evidence-driven, continuous practice: deep scans as a launch gate, diff scans in the PR flow, “no-issue” results tracked as open surfaces rather than clearances, and secrets discipline enforced because the scanner reliably catches violations.
SonarQube saying “Security: 0” was never reassuring. Now I know exactly how much it wasn’t telling me.