skillscan.sh
scoreboard · corpus v1.1 · directional · 2026-06-17
[!]
DRAFT / PREVIEW — directional, small-n (real-wild malicious n=5, all one archetype). Not a certification. We grade scanners, not skills. Numbers will change; not for citation.

Reasoning is the floor for skill safety.

On attacks we did not author, cheap and local detection caps at ~32%. The only thing that clears bar is a frontier model that reasons about the skill — claude-sonnet 81%, and it tops out there (opus ties it, McNemar p≈1.0). Pattern-matching and cheap classifiers can triage; they can't decide intent. Reasoning isn't an upgrade — it's the entry price, and even paid in full it still misses ~1 in 5.

solo + independent · no vendor, nothing to sell · code · the full story · corrections welcome from anyone

13–32%
signature scanners on independent attacks — the bet most tools ship; they miss most novel attacks, or over-block to compensate
23 → 81%
same 84 attacks, cheapest model to best — recall climbs with reasoning, not with rules
81 = 81%
claude-sonnet ties claude-opus at the ceiling — the best model money can buy still misses ~1 in 5
// THE CONCLUSION

Detection needs reasoning, not pattern-matching

Three things a scanner must do that only reasoning can — each is why a rule or a cheap classifier hits a ceiling:

1 · Tell malice from dual-use. The same capability is benign or malicious by intent. Read files → summarize your documents, or exfiltrate secrets. Make HTTP calls → hit a public API, or ship data to an attacker. A rule sees the capability; only reasoning weighs the intent. (full table in Where this leaves us)

2 · Catch hidden detonators. Logic that fires only later — on a date ("every Tuesday at 2am"), a trigger input, or a specific environment — isn't exercised during a static read, so there's nothing to match.

3 · Catch behaviorally-invisible instructions. The hardest class: pure natural language — no payload, no syscall, nothing for static or behavioral detection to ever flag. The malice is in what the agent is told to do or not do; it reads as a helpful note and is usually one stage of a larger attack:

None of these contain a flaggable string, a dangerous call, or an observable side-effect. They are malicious only because of what they mean — which is exactly the judgment a rule or a cheap classifier can't make, and a reasoning model sometimes can.

// WHY WE RAN THIS

We built a scanner, it wasn't good — and no one else's approach was different

Why benchmark. I built a free, local, private scanner and measured it honestly: it scored around a coin flip. Scanning the field, nobody's approach was fundamentally different — pattern rules, a cheap local classifier, or an LLM read. So the question stopped being "is my tool bad?" and became "is there a ceiling on this whole approach, and how high?"

The goal we were testing. Detection you can actually own — cheap, local, private, open: no GPU, no per-scan fee, no shipping your files to a company.

How we got here. ① Built our own → it under-performed. ② Found the field ships the same three approaches. ③ Ran a pre-registered experiment on an independent corpus we didn't author. I retired the scanner; this scoreboard is what I built instead (it's in the board as a non-graded reference, held to the same test).

// THE REAL TEST

An independent benchmark we didn't author

Our own corpus is LLM-generated, so an LLM scoring it has a self-recognition edge. The number that counts is on data we did not author — Skill-Inject (arXiv:2602.20156), 84 published malicious cases scored unchanged. Across the LLMs, recall is monotone in capability — until it hits a wall.

└─ recall on Skill-Inject · n=84 · dashed line = 81% ceiling (best model still misses ~1 in 5)
SkillSpector — static · local / free32% [27/84]
Cisco — static · local / free13% [10/75]
gpt-4o-mini · cloud / cheap23% [19/84]
claude-haiku-4.5 · cloud / cheap26% [22/84]
gpt-4o · cloud / mid38% [32/84]
claude-sonnet-4.6 · cloud / flagship81% [68/84]
claude-opus-4.8 · cloud / max81% [68/84 · ties sonnet]

A 4× spread by model, then a wall: "use an LLM" isn't advice without naming the model — and "buy the best" doesn't break the ceiling either (sonnet 81%, opus 81% — a dead tie, McNemar p≈1.0). Cisco errored on cases (denominator < 84); counting them as misses tells the same story.

Per-model numbers, scope & how we read it
Scanner / modelApproach & costRecall on Skill-Inject (independent, n=84)
SkillSpector — staticlocal / free32% [23–43] 27/84
Cisco — staticlocal / free13% [7–23] 10/75
gpt-4o-mini — LLM raw readcloud / cheap23% [15–33] 19/84
claude-haiku-4.5 — LLM raw readcloud / cheap26% [18–36] 22/84
gpt-4o — LLM raw readcloud / mid38% [28–49] 32/84
claude-sonnet-4.6 — LLM raw readcloud / flagship81% [71–88] 68/84
claude-opus-4.8 — LLM raw readcloud / max81% [71–88] 68/84 — ties sonnet (the ceiling)

Cisco's denominator is 75 — it errored/declined on 9 (excluded per Methodology §5; counting those as misses gives 10/84 = 12%, same story). All five LLMs had zero refusals on the generic prompt. On model + baseline selection (a deliberate scope, not an omission): the question here is intelligence vs. shortcuts — does a capable model reading the skill beat cheap pattern-matching, and is there a ceiling — not which frontier model is best. Five models spanning cheapest→flagship answer that; a model leaderboard (Codex, Gemini, …) and a human-expert baseline would answer different questions, so both are deliberately out of scope — the comparison is to the cheap/local techniques the field actually ships, not to an analyst or to a model ranking. (Access aside: OpenAI + Anthropic are the providers with direct, scriptable APIs that complete on malicious content — the §3b gateway wall blocks routing the rest through OpenRouter; a direct-keyed Gemini/Llama column is welcome via the adapter, but that's leaderboard completeness, not the thesis.)

Flagship-wins-regardless-of-vendor, not "Claude always wins": mid-tier gpt-4o (38%) beats cheap claude-haiku (26%); the two best models (Sonnet, Opus) converge at ~81% and go no higher. Same 84 attacks score 23%→81% by model — so "use an LLM" isn't advice without naming it, and "buy the best" doesn't break the ceiling.

Capability of what kind, though? We say "capability," but the 43-pt gpt-4o→Claude gap is likely not raw reasoning horsepower alone — Anthropic's safety-alignment (Constitutional AI / RLHF) over-indexes on agentic-misuse and jailbreak detection, so Claude may be acting partly as a purpose-tuned security classifier while gpt-4o is tuned as a general assistant. We attribute the result to flagship capability combined with safety-alignment weighting and don't claim to separate them — both are off-the-shelf properties a buyer gets or doesn't, which is what the board measures.

Why recall alone never ranks a scanner — see SkillGate (the block-all corner). SkillGate (FOSS, pure-static, run offline in the sandbox via its own gate check --policy) is a useful counter-example. At its pre-install profile it catches almost everything — 98% of the independent injections, 97% (116/120) of benign skills, and 98% (292/297) of dual-use — because it blocks ~everything. A scanner that blocks nearly every skill trivially "catches" nearly every attack; its balanced accuracy is still ≈ 0.45. Its only discriminating profile (audit) drops to 31% (157/507) recall while still flagging 41% (49/120) of benign (balance ≈ 0.35). No SkillGate profile is a usable discriminator. That is the whole point of plotting recall against false-positives, and why the flagship LLM's 81% — at a far lower FP — is real detection, not a high number. SkillGate sits in the top-right (block-all) corner of the master table's scatter.

Even the winner isn't free of the hard tier: claude-sonnet is strong on both overt 86% [71–94] 31/36 and indirect 77% [63–87] 37/48 injections — the only model that doesn't collapse on the subtle attacks (gpt-4o falls from overt 56% [40–70] 20/36 to indirect 25% [15–39] 12/48). An standard injection-aware prompt (published verbatim in §3a) did not rescue the weaker models — gpt-4o's pooled recall fell to 17% (14/84) when primed. We read that narrowly: standard adversarial priming degraded performance here (most plausibly safety-filter over-triggering on the longer risk-laden instruction), not "prompting can't help." The defensible claim is that out-of-the-box reasoning capability + safety-alignment, not prompt engineering, is the primary differentiator — a hand-tuned per-model prompt might recover some, but that's engineering no off-the-shelf deployment gets for free.

Composition caveat: Skill-Inject's 84 cases are not archetype-balanced — they skew to agent-hijacking (42/84, 50%; data-exfiltration 26, code-execution 16). The pooled recall therefore reflects their mix, not the wild distribution, so we report the per-tier split (overt/indirect) alongside the pooled number. It's their benchmark, scored unchanged — we don't reweight it. (Archetype and tier are two orthogonal cuts of the same 84: 42/26/16 by archetype, 36/48 by overt/indirect tier — not a discrepancy.)

Contamination check on the winner (and its limits): Claude is one of the models that helped author our in-house organic corpus, so we distrust a Claude score on our own data (self-recognition) — which is why we lead with Skill-Inject, a different research group's benchmark we authored none of. That defeats the our-authorship vector. We are careful not to claim more: Skill-Inject is built from a fixed set of human-authored injection templates (their published obvious_injections.json (36) + contextual_injections.json (48) — instruction strings with explicit goal/judge metadata, which we templated unchanged into base skills), not free-form LLM prose, so same-family LLM generation is implausible by construction — but we cannot independently verify its authors' pipeline, and a public arXiv benchmark (2602.20156) could in principle enter a model's pretraining. We therefore do not call this "uncontaminated" in the absolute. What argues the 81% is capability, not recognition: the same inputs span 26% (claude-haiku) → 38% (gpt-4o) → 81% (claude-sonnet) — a 43-pt jump from mid-tier gpt-4o to the flagship that no memorisation story explains — and claude-sonnet holds up on the harder indirect tier. All five LLMs ran via direct provider APIs (OpenAI / Anthropic); the §3b managed-gateway wall is why we avoid OpenRouter, not a missing key.

Takeaway: detection is achievable, but you pay full freight for it. The cheap/local/private bet (static rules, small models) fails on independent attacks; the one method that works — a flagship frontier model reading the skill — is the cloud, paid, file-leaking, token-burning option the whole exercise tried to avoid. Capability is the price. Corroborated in shape by the BIV study (arXiv:2605.11770): rule-based 0.44 vs LLM 0.927 — though that is an F1 on a less-evasive benchmark, not recall on evasive injections, so the gap (static ≪ a strong LLM) transfers, the absolute level does not.

// FULL BOARD

Every scanner, one table

One row per scanner × mode, same axes. Recall sits next to its false-positive cost; bars show the point estimate. Balance = balanced accuracy (0.5 = coin-flip), not a ranking. LLM-control and author-reference rows are dimmed (not graded products). The in-house recall here is self-recognition-flattered — for the honest cross-model number see the ladder above.

filter:click a column ⇅ to sort · click a chip to filter
ScannerRecallFalse-posWildBal
Cisco AI Defense
static rules · offline
6%
5%
known / novel
8%
dual-use · benign 0%
1/40.50
SkillGate (preinstall gate)
static rules · offline
75%
94%
known / novel
98%
dual-use · benign 97%
5/50.44
SkillSpector
static rules · offline
6%
7%
known / novel
25%
dual-use · benign 0%
2/50.44
Cisco AI Defense +llm
+llm · gpt-4o direct
83%
85%
known / novel
35%
dual-use · benign 10%
4/40.78
SkillSpector +llm
+llm · gpt-4o direct
4%
7%
known / novel
20%
dual-use · benign 0%
2/50.46
Snyk Agent Scan
cloud · cloud LLM
72%
81%
known / novel
34%
dual-use · benign 15%
3/50.74
LLM baseline · Qwen-72B (in-set)
LLM read · open-weight
58%
32%
known / novel
1%
dual-use · benign 0%
0/50.72
LLM baseline · phi-4 (disjoint)
LLM read · open-weight
65%
53%
known / novel
3%
dual-use · benign 0%
0/40.78
claude-haiku-4.5
raw read · direct API
80%
93%
known / novel
2%
dual-use · benign 0%
0.93
claude-opus-4.8
raw read · direct API
43%
90%
known / novel
16%
dual-use · benign 4%
0.79
claude-sonnet-4.6
raw read · direct API
64%
97%
known / novel
2%
dual-use · benign 0%
0.90
gpt-4o
raw read · direct API
85%
85%
known / novel
5%
dual-use · benign 2%
0.91
gpt-4o-mini
raw read · direct API
84%
64%
known / novel
5%
dual-use · benign 2%
0.85
skillscan (mine, retired)
static rules · offline
69%
69%
known / novel
87%
dual-use · benign 78%
1/10.42
skillscan (mine, retired)
local ML · offline
59%
69%
known / novel
42%
dual-use · benign 78%
1/10.57
Cisco AI Defense
static rules · offline
0.50
BALANCE
recall6% kn
5% nv
FP dual8%
wild 1/4 · FP benign 0%
SkillGate (preinstall gate)
static rules · offline
0.44
BALANCE
recall75% kn
94% nv
FP dual98%
wild 5/5 · FP benign 97%
SkillSpector
static rules · offline
0.44
BALANCE
recall6% kn
7% nv
FP dual25%
wild 2/5 · FP benign 0%
Cisco AI Defense +llm
+llm · gpt-4o direct
0.78
BALANCE
recall83% kn
85% nv
FP dual35%
wild 4/4 · FP benign 10%
SkillSpector +llm
+llm · gpt-4o direct
0.46
BALANCE
recall4% kn
7% nv
FP dual20%
wild 2/5 · FP benign 0%
Snyk Agent Scan
cloud · cloud LLM
0.74
BALANCE
recall72% kn
81% nv
FP dual34%
wild 3/5 · FP benign 15%
LLM baseline · Qwen-72B (in-set)
LLM read · open-weight
0.72
BALANCE
recall58% kn
32% nv
FP dual1%
wild 0/5 · FP benign 0%
LLM baseline · phi-4 (disjoint)
LLM read · open-weight
0.78
BALANCE
recall65% kn
53% nv
FP dual3%
wild 0/4 · FP benign 0%
claude-haiku-4.5
raw read · direct API
0.93
BALANCE
recall80% kn
93% nv
FP dual2%
wild — · FP benign 0%
claude-opus-4.8
raw read · direct API
0.79
BALANCE
recall43% kn
90% nv
FP dual16%
wild — · FP benign 4%
claude-sonnet-4.6
raw read · direct API
0.90
BALANCE
recall64% kn
97% nv
FP dual2%
wild — · FP benign 0%
gpt-4o
raw read · direct API
0.91
BALANCE
recall85% kn
85% nv
FP dual5%
wild — · FP benign 2%
gpt-4o-mini
raw read · direct API
0.85
BALANCE
recall84% kn
64% nv
FP dual5%
wild — · FP benign 2%
skillscan (mine, retired)
static rules · offline
0.42
BALANCE
recall69% kn
69% nv
FP dual87%
wild 1/1 · FP benign 78%
skillscan (mine, retired)
local ML · offline
0.57
BALANCE
recall59% kn
69% nv
FP dual42%
wild 1/1 · FP benign 78%

Balance = balanced accuracy = (recall + specificity)/2, prevalence-independent (deliberately not F1). 0.5 = coin-flip, 1.0 = perfect. Every recall/FP carries a Wilson 95% interval (shown in the per-row detail / live site).

// THE PICTURE

Recall vs false-positives

Top-left is ideal — catches much, flags little. Two clusters: static scanners sit bottom-left (catch little); LLM-reading sits top. SkillGate sits top-right: it "catches" everything by blocking everything. (axes = in-house corpus, where FP is measurable)

↑ recall (catches more attacks)false-positive on legit dual-use →ideal0%0%25%25%50%50%75%75%100%100%claude-haiku-4.5SkillGate (preinstall gate)gpt-4oCisco AI Defense +llmclaude-sonnet-4.6Snyk Agent Scangpt-4o-miniskillscan (rules)claude-opus-4.8skillscan (ML)phi-4 (disjoint)Qwen-72B (in-set)SkillSpectorSkillSpector +llmCisco AI Defenseaxes = in-house corpus; the contamination-free winner (claude-sonnet 81% at low FP) is on the independent Skill-Inject table
staticLLM-productLLM-controlauthor-ref
// INTEGRATION > PRESENCE

The same model, wired two ways: ~83% vs ~4–7%

Both SkillSpector and Cisco ship a +llm mode; we ran both against the same model (gpt-4o, direct). The result splits entirely on how the model is wired in:

> Same model, an order of magnitude apart. The integration, not "having an LLM," decides it.

Not a harness artifact: the model demonstrably fired (non-zero refusals + shifted FP vs static); backend availability is itself a finding — see Methodology §3b.

// WHERE THIS LEAVES US

Read-time scanning is a triage layer, not the security boundary

One question: does read-time skill scanning have a practical ceiling, and how high is it? It does — ~81%. The two best frontier models tie there; static rules, cheap local classifiers, and earlier models don't clear bar at all. Not give up, and not that scanning is useless — but the ceiling is low, for a structural reason.

A read-time scanner sees text, code, and metadata. It does not see future user intent, runtime context, network behavior, credential use, or tool effects — the things that decide whether a capable skill gets used benignly or maliciously. A bigger model (Opus) doesn't move the ceiling, because the limit is the information available at read time, not the reviewer's intelligence:

The same capability is benign or malicious depending on intent, context, and authority a scanner can't see at review time:

CapabilityBenign useMalicious useWhy read-time can't tell
Read local filessummarize your documentsexfiltrate secretssame primitive — intent differs
Make HTTP requestscall a public APIsend data to an attackerdestination may be configurable or delayed
Transform textclean datasmuggle a prompt injectionbehavior depends on the input it's given
Wrap a shell commanddeveloper automationcredential theftdanger depends on the command + user context
Drive a browserfill formsabuse a logged-in sessionruntime authority is what matters

And the one method that does clear bar — a top frontier model reading the skill — still isn't an enforcement boundary you can own. To use it you must:

Treat skill review as a linting and triage layer — not the enforcement boundary.

The boundary needs to move closer to execution: capability isolation, runtime permissioning, provenance / signed identity, behavioral monitoring, and constrained tool APIs. A direction the data motivates — not something this board measures.

How we scope this claim — and what we don't claim

Three claims, kept separate so the strong one isn't overstated: empirical — the read-time scanners we tested perform poorly or incompletely on this corpus; practical — the best observed operating point isn't good enough to serve as a security boundary; structural — some failures are inherent, because a skill's maliciousness can depend on execution context unavailable at scan time (the table above). The first two we measure here; the third is the table's argument, not a proof.

Scoped honestly, this is about the read-time, static / one-shot review paradigm — not a claim that no future system could do better. The lesson isn't build a better regex or use a bigger review model; those improve triage but don't change the information boundary.

// THE FINE PRINT
Does it catch NEW attacks? — the generalization gap

A large positive gap = catches known campaigns but not novel disguised behaviour. Novel recall excludes a scanner's own generator family. A negative gap (novel ≥ known) is the signature of genuine behavioural detection, not memorisation. Significance is BH-FDR corrected across 15 comparisons (q=0.05). "Known ≫ novel?" is literal for static scanners (no signature → cannot fire) but an upper bound for LLM scanners (see Methodology §4).

scanner · archetypeknown → novelgapverdict
Cisco AI Defense skill-scanner · agent_hijacking
16%0%
+16%known≫novel
Cisco AI Defense skill-scanner · code_execution
5%15%
-10%no gap
Cisco AI Defense skill-scanner · data_exfiltration
5%0%
+5%no gap
LLM baseline (Qwen-72B, in-set) · agent_hijacking
58%34%
+24%no gap
LLM baseline (Qwen-72B, in-set) · code_execution
58%31%
+27%known≫novel
LLM baseline (Qwen-72B, in-set) · data_exfiltration
57%31%
+26%known≫novel
LLM baseline (phi-4, disjoint) · agent_hijacking
62%47%
+15%no gap
LLM baseline (phi-4, disjoint) · code_execution
64%56%
+8%no gap
LLM baseline (phi-4, disjoint) · data_exfiltration
68%56%
+12%no gap
SkillSpector · agent_hijacking
8%0%
+8%known≫novel
SkillSpector · code_execution
6%8%
-2%no gap
SkillSpector · data_exfiltration
3%12%
-9%no gap
Snyk Agent Scan · agent_hijacking
73%72%
+1%no gap
Snyk Agent Scan · code_execution
72%88%
-16%no gap
Snyk Agent Scan · data_exfiltration
70%84%
-14%no gap

Caveat (static): because novel samples are defanged, a rule keying on a real payload misses them partly because the matchable artifact was neutralised, not purely because the behaviour is novel — so static gap is an over-estimate. Caveat (LLM): "known" (organic) is also more blatant than the disguised synthetic, so the gap conflates training-recognition with disguise-difficulty. Note a negative gap (novel ≥ known, e.g. Snyk on some archetypes) is not a paradox — it's the signature of genuine behavioural detection (catching novel as well as or better than known), the opposite of memorisation.

On REAL wild malware? — the cleanest, smallest signal

The only fully-unimpeachable data — and the smallest: real, in-the-wild malicious skills are scarce and ephemeral. Our set is n=5, all one archetype (code-execution), so intervals are wide and we treat wild as supporting evidence, not proof. Independent corroboration: a USENIX 2026 study found only 157 malicious in 98,380 skills (~0.16%). See the Wild column in the master table (shown as k/n).

My own retired scanner — held to the same test

To prove I'm not exempting myself: skillscan, the scanner I built and retired, run at its strict default over the same corpus (a non-graded reference). Both modes are fully offline — static rules and a local fine-tuned model (Qwen2.5-1.5B), exactly the cheap/local/no-GPU bet. At strict it over-blocks (catches a lot but flags ~78% of benign and ~87% of dual-use), so its balance ≈ a coin flip (0.42 static / 0.57 local-ML). The mirror image of the others (which miss at low FP); same bottom line — no real separation. Read the local-ML row with extra suspicion against me: that detector was trained on generated skills of the same lineage as this corpus and largely memorised them, so its recall is a contamination-inflated upper bound. See its rows in the master table (author-reference band).

Why you can trust these numbers (and where they're soft)

Populations: 1002 total = 918 base corpus (the scanners' main scores) + 84 independent Skill-Inject (scored separately, above). The 918 base splits 423 malicious + 495 benign/dual-use; malicious by archetype incl. independent: code-execution 200, data-exfiltration 164, agent-hijacking 143 (sums to 507 = 423 base + 84 independent). Real-wild malicious = 5 (all code-execution). Generation provenance (disclosed): the in-house malicious set is LLM-written — organic via our tooling (gpt-4o / Claude Sonnet / DeepSeek; per-sample split not recorded), defanged synthetic via open-weight models (mixtral-8x22b / gemma-2-27b / hermes-3 / llama-3.3-70b / qwen2.5-72b / deepseek-v3.1, ~even split). Because the in-house set is LLM-written, we lead with the independent Skill-Inject numbers, not the higher (self-recognition-flattered) in-house recall.

Not self-recognition: the best detector (phi-4) generated none of the corpus and is scored cross-family; the in-set Qwen scores worst on its own samples; on benign skills run through the same generators + defang, the baselines flag 0% — they key on malice, not machine-generated style. Dual-use labels come from a single automated open-weight judge (llama-3.3-70b), not human inter-rater — but since the FP axis rests on them, a second rater from a different model family (claude-sonnet) agreed 100% (50/50) on a random sample that the dual-use skills are non-malicious (so a flag is a real FP, not a mislabel; human grading of the subset is the gold-standard follow-up — §2.4). Every recall/FP carries a Wilson 95% interval; we never rank on recall alone. SkillSpector & Cisco static layers were graded; their +llm modes ran only on gpt-4o-direct (other backends blocked/limited, §3b).

What we grade vs cite — the access bar

Inclusion rule: we grade what a normal developer can run without contacting sales or a big spend — FOSS or a free self-serve tier. Tools that are sales-gated or have no self-serve API we cite but cannot benchmark (you can't script a web form). License isn't the bar — accessibility is; a free commercial self-serve tier qualifies. Tested data is never discarded; new free/self-serve scanners welcome (the harness takes an adapter).

SkillSpector · CiscogradedFOSS, self-serve
Snyk Agent Scangradedfree self-serve tier (cloud LLM)
SkillGate (charliechenye, MIT)gradedFOSS pure-static; run offline in the sandbox via its own gate (check --policy). The block-all corner: at preinstall it flags ~97% of benign (balance 0.45); audit is more lenient but still over-blocks at 31% recall. No usable profile
ESET AI Skills Checkerciteweb-form only, no API — not scriptable
Mitiga Skillgateciteaccount-gated, no public API
SkillSieve (arXiv:2604.06550)cite — pendingOSS announced, repo not yet live
BIV (arXiv:2605.11770)citecorroborates static≪LLM in shape (F1 0.44 vs 0.927 — F1 on a less-evasive set, not recall on evasive injections)

Cross-scanner consensus (SkillSpector + Cisco): full agreement on only 3.4% of 467 malicious samples — the subset (of 507 total malicious incl. the 84 independent) on which both returned a non-error verdict (40 are excluded because one scanner errored/declined, per §5). We compute consensus over the two discriminating static scanners only; SkillGate is excluded because it blocks ~everything, so it would trivially "agree" and inflate the number. Since SkillSpector and Cisco both barely fire, low agreement is partly mechanical: read it as "they rarely co-fire," not deep semantic disagreement.

Quick glossary

recall = how many real attacks a scanner catches · known vs novel = attacks already documented vs disguised/never-seen (novel is the real test) · false-positive = safe tools it wrongly flags · dual-use = legitimate-but-scary skills (e.g. a real auth tool that reads a token) · static rules = pattern-matching, the cheap/private bet · +llm / cloud = sending the file to a model · refusal = the scanner errored or declined instead of giving a verdict · balance = balanced accuracy, 0.5 = coin-flip.