On attacks we did not author, cheap and local detection caps at ~32%. The only thing that clears bar is a frontier model that reasons about the skill — claude-sonnet 82%, and it tops out there (opus ties it, McNemar p≈1.0). Pattern-matching and cheap classifiers can triage; they can't decide intent. Reasoning isn't an upgrade — it's the entry price, and even paid in full it still misses ~1 in 5.
solo + independent · no vendor, nothing to sell · code · the full story · corrections welcome from anyone
Three things a scanner must do that only reasoning can — each is why a rule or a cheap classifier hits a ceiling:
1 · Tell malice from dual-use. The same capability is benign or malicious by intent. Read files → summarize your documents, or exfiltrate secrets. Make HTTP calls → hit a public API, or ship data to an attacker. A rule sees the capability; only reasoning weighs the intent. (full table in Where this leaves us)
2 · Catch hidden detonators. Logic that fires only later — on a date ("every Tuesday at 2am"), a trigger input, or a specific environment — isn't exercised during a static read, so there's nothing to match.
3 · Catch behaviorally-invisible instructions. The hardest class: pure natural language — no payload, no syscall, nothing for static or behavioral detection to ever flag. The malice is in what the agent is told to do or not do; it reads as a helpful note and is usually one stage of a larger attack:
None of these contain a flaggable string, a dangerous call, or an observable side-effect. They are malicious only because of what they mean — which is exactly the judgment a rule or a cheap classifier can't make, and a reasoning model sometimes can.
Why benchmark. I built a free, local, private scanner and measured it honestly: it scored around a coin flip. Scanning the field, nobody's approach was fundamentally different — pattern rules, a cheap local classifier, or an LLM read. So the question stopped being "is my tool bad?" and became "is there a ceiling on this whole approach, and how high?"
The goal we were testing. Detection you can actually own — cheap, local, private, open: no GPU, no per-scan fee, no shipping your files to a company.
How we got here. ① Built our own → it under-performed. ② Found the field ships the same three approaches. ③ Ran a pre-registered experiment on an independent corpus we didn't author. I retired the scanner; this scoreboard is what I built instead (it's in the board as a non-graded reference, held to the same test).
Our own corpus is LLM-generated, so an LLM scoring it has a self-recognition edge. The number that counts is on data we did not author — Skill-Inject (arXiv:2602.20156), 84 published malicious cases scored unchanged. Across the LLMs, recall is monotone in capability — until it hits a wall.
A 4× spread by model, then a wall: "use an LLM" isn't advice without naming the model — and "buy the best" doesn't break the ceiling either (sonnet 82% vs opus 81%, McNemar p≈1.0). Cisco errored on cases (denominator < 84); counting them as misses tells the same story.
| Scanner / model | Approach & cost | Recall on Skill-Inject (independent, n=84) |
|---|---|---|
| SkillSpector — static | local / free | 32% [23–43] 27/84 |
| Cisco — static | local / free | 13% [7–23] 10/75 |
| gpt-4o-mini — LLM raw read | cloud / cheap | 23% [15–33] 19/84 |
| claude-haiku-4.5 — LLM raw read | cloud / cheap | 29% [20–39] 24/84 |
| gpt-4o — LLM raw read | cloud / mid | 38% [28–49] 32/84 |
| claude-sonnet-4.6 — LLM raw read | cloud / flagship | 82% [73–89] 69/84 |
| claude-opus-4.8 — LLM raw read | cloud / max | 81% [71–88] 68/84 — ties sonnet (the ceiling) |
Cisco's denominator is 75 — it errored/declined on 9 (excluded per Methodology §5; counting those as misses gives 10/84 = 12%, same story). All five LLMs had zero refusals on the generic prompt. On model + baseline selection (a deliberate scope, not an omission): the question here is intelligence vs. shortcuts — does a capable model reading the skill beat cheap pattern-matching, and is there a ceiling — not which frontier model is best. Five models spanning cheapest→flagship answer that; a model leaderboard (Codex, Gemini, …) and a human-expert baseline would answer different questions, so both are deliberately out of scope — the comparison is to the cheap/local techniques the field actually ships, not to an analyst or to a model ranking. (Access aside: OpenAI + Anthropic are the providers with direct, scriptable APIs that complete on malicious content — the §3b gateway wall blocks routing the rest through OpenRouter; a direct-keyed Gemini/Llama column is welcome via the adapter, but that's leaderboard completeness, not the thesis.)
Flagship-wins-regardless-of-vendor, not "Claude always wins": mid-tier gpt-4o (38%) beats cheap claude-haiku (29%); the two best models (Sonnet, Opus) converge at ~82% and go no higher. Same 84 attacks score 23%→82% by model — so "use an LLM" isn't advice without naming it, and "buy the best" doesn't break the ceiling.
Capability of what kind, though? We say "capability," but the 44-pt gpt-4o→Claude gap is likely not raw reasoning horsepower alone — Anthropic's safety-alignment (Constitutional AI / RLHF) over-indexes on agentic-misuse and jailbreak detection, so Claude may be acting partly as a purpose-tuned security classifier while gpt-4o is tuned as a general assistant. We attribute the result to flagship capability combined with safety-alignment weighting and don't claim to separate them — both are off-the-shelf properties a buyer gets or doesn't, which is what the board measures.
Why recall alone never ranks a scanner — see SkillGate (the block-all corner). SkillGate (FOSS, pure-static, run offline in the sandbox via its own gate check --policy) is a useful counter-example. At its pre-install profile it catches almost everything — 98% of the independent injections, 97% (116/120) of benign skills, and 98% (292/297) of dual-use — because it blocks ~everything. A scanner that blocks nearly every skill trivially "catches" nearly every attack; its balanced accuracy is still ≈ 0.45. Its only discriminating profile (audit) drops to 31% (157/507) recall while still flagging 41% (49/120) of benign (balance ≈ 0.35). No SkillGate profile is a usable discriminator. That is the whole point of plotting recall against false-positives, and why the flagship LLM's 82% — at a far lower FP — is real detection, not a high number. SkillGate sits in the top-right (block-all) corner of the master table's scatter.
Even the winner isn't free of the hard tier: claude-sonnet is strong on both overt 86% [71–94] 31/36 and indirect 79% [66–88] 38/48 injections — the only model that doesn't collapse on the subtle attacks (gpt-4o falls from overt 56% [40–70] 20/36 to indirect 25% [15–39] 12/48). An standard injection-aware prompt (published verbatim in §3a) did not rescue the weaker models — gpt-4o's pooled recall fell to 17% (14/84) when primed. We read that narrowly: standard adversarial priming degraded performance here (most plausibly safety-filter over-triggering on the longer risk-laden instruction), not "prompting can't help." The defensible claim is that out-of-the-box reasoning capability + safety-alignment, not prompt engineering, is the primary differentiator — a hand-tuned per-model prompt might recover some, but that's engineering no off-the-shelf deployment gets for free.
Composition caveat: Skill-Inject's 84 cases are not archetype-balanced — they skew to agent-hijacking (42/84, 50%; data-exfiltration 26, code-execution 16). The pooled recall therefore reflects their mix, not the wild distribution, so we report the per-tier split (overt/indirect) alongside the pooled number. It's their benchmark, scored unchanged — we don't reweight it. (Archetype and tier are two orthogonal cuts of the same 84: 42/26/16 by archetype, 36/48 by overt/indirect tier — not a discrepancy.)
Contamination check on the winner (and its limits): Claude is one of the models that helped author our in-house organic corpus, so we distrust a Claude score on our own data (self-recognition) — which is why we lead with Skill-Inject, a different research group's benchmark we authored none of. That defeats the our-authorship vector. We are careful not to claim more: Skill-Inject is built from a fixed set of human-authored injection templates (their published obvious_injections.json (36) + contextual_injections.json (48) — instruction strings with explicit goal/judge metadata, which we templated unchanged into base skills), not free-form LLM prose, so same-family LLM generation is implausible by construction — but we cannot independently verify its authors' pipeline, and a public arXiv benchmark (2602.20156) could in principle enter a model's pretraining. We therefore do not call this "uncontaminated" in the absolute. What argues the 82% is capability, not recognition: the same inputs span 29% (claude-haiku) → 38% (gpt-4o) → 82% (claude-sonnet) — a 44-pt gap between the cheap models and the flagship that no memorisation story explains — and claude-sonnet holds up on the harder indirect tier. All four LLMs ran via direct provider APIs (OpenAI / Anthropic); the §3b managed-gateway wall is why we avoid OpenRouter, not a missing key.
Takeaway: detection is achievable, but you pay full freight for it. The cheap/local/private bet (static rules, small models) fails on independent attacks; the one method that works — a flagship frontier model reading the skill — is the cloud, paid, file-leaking, token-burning option the whole exercise tried to avoid. Capability is the price. Corroborated in shape by the BIV study (arXiv:2605.11770): rule-based 0.44 vs LLM 0.927 — though that is an F1 on a less-evasive benchmark, not recall on evasive injections, so the gap (static ≪ a strong LLM) transfers, the absolute level does not.
One row per scanner × mode, same axes. Recall sits next to its false-positive cost; bars show the point estimate. Balance = balanced accuracy (0.5 = coin-flip), not a ranking. LLM-control and author-reference rows are dimmed (not graded products). The in-house recall here is self-recognition-flattered — for the honest cross-model number see the ladder above.
| Scanner | Recall | False-pos | Wild | Bal |
|---|---|---|---|---|
Cisco AI Defense static rules · offline | 6% 5% known / novel | 8% dual-use · benign 0% | 1/4 | 0.50 |
SkillGate (preinstall gate) static rules · offline | 75% 94% known / novel | 98% dual-use · benign 97% | 5/5 | 0.44 |
SkillSpector static rules · offline | 6% 7% known / novel | 25% dual-use · benign 0% | 2/5 | 0.44 |
Cisco AI Defense +llm +llm · gpt-4o direct | 83% 85% known / novel | 35% dual-use · benign 10% | 4/4 | 0.78 |
SkillSpector +llm +llm · gpt-4o direct | 4% 7% known / novel | 20% dual-use · benign 0% | 2/5 | 0.46 |
Snyk Agent Scan cloud · cloud LLM | 72% 81% known / novel | 34% dual-use · benign 15% | 3/5 | 0.74 |
LLM baseline · Qwen-72B (in-set) LLM read · open-weight | 58% 32% known / novel | 1% dual-use · benign 0% | 0/5 | 0.72 |
LLM baseline · phi-4 (disjoint) LLM read · open-weight | 65% 53% known / novel | 3% dual-use · benign 0% | 0/4 | 0.78 |
claude-haiku-4.5 raw read · direct API | 80% 93% known / novel | 2% dual-use · benign 0% | — | 0.93 |
claude-opus-4.8 raw read · direct API | 43% 90% known / novel | 16% dual-use · benign 4% | — | 0.79 |
claude-sonnet-4.6 raw read · direct API | 64% 97% known / novel | 2% dual-use · benign 0% | — | 0.90 |
gpt-4o raw read · direct API | 85% 85% known / novel | 5% dual-use · benign 2% | — | 0.91 |
gpt-4o-mini raw read · direct API | 84% 64% known / novel | 5% dual-use · benign 2% | — | 0.85 |
skillscan (mine, retired) static rules · offline | 69% 69% known / novel | 87% dual-use · benign 78% | 1/1 | 0.42 |
skillscan (mine, retired) local ML · offline | 59% 69% known / novel | 42% dual-use · benign 78% | 1/1 | 0.57 |
Balance = balanced accuracy = (recall + specificity)/2, prevalence-independent (deliberately not F1). 0.5 = coin-flip, 1.0 = perfect. Every recall/FP carries a Wilson 95% interval (shown in the per-row detail / live site).
Top-left is ideal — catches much, flags little. Two clusters: static scanners sit bottom-left (catch little); LLM-reading sits top. SkillGate sits top-right: it "catches" everything by blocking everything. (axes = in-house corpus, where FP is measurable)
Both SkillSpector and Cisco ship a +llm mode; we ran both against the same model (gpt-4o, direct). The result splits entirely on how the model is wired in:
cff7ecc) — not as the verdict engine.> Same model, an order of magnitude apart. The integration, not "having an LLM," decides it.
Not a harness artifact: the model demonstrably fired (non-zero refusals + shifted FP vs static); backend availability is itself a finding — see Methodology §3b.
One question: does read-time skill scanning have a practical ceiling, and how high is it? It does — ~82%. The two best frontier models tie there; static rules, cheap local classifiers, and earlier models don't clear bar at all. Not give up, and not that scanning is useless — but the ceiling is low, for a structural reason.
A read-time scanner sees text, code, and metadata. It does not see future user intent, runtime context, network behavior, credential use, or tool effects — the things that decide whether a capable skill gets used benignly or maliciously. A bigger model (Opus) doesn't move the ceiling, because the limit is the information available at read time, not the reviewer's intelligence:
The same capability is benign or malicious depending on intent, context, and authority a scanner can't see at review time:
And the one method that does clear bar — a top frontier model reading the skill — still isn't an enforcement boundary you can own. To use it you must:
Treat skill review as a linting and triage layer — not the enforcement boundary.
The boundary needs to move closer to execution: capability isolation, runtime permissioning, provenance / signed identity, behavioral monitoring, and constrained tool APIs. A direction the data motivates — not something this board measures.
Three claims, kept separate so the strong one isn't overstated: empirical — the read-time scanners we tested perform poorly or incompletely on this corpus; practical — the best observed operating point isn't good enough to serve as a security boundary; structural — some failures are inherent, because a skill's maliciousness can depend on execution context unavailable at scan time (the table above). The first two we measure here; the third is the table's argument, not a proof.
Scoped honestly, this is about the read-time, static / one-shot review paradigm — not a claim that no future system could do better. The lesson isn't build a better regex or use a bigger review model; those improve triage but don't change the information boundary.
A large positive gap = catches known campaigns but not novel disguised behaviour. Novel recall excludes a scanner's own generator family. A negative gap (novel ≥ known) is the signature of genuine behavioural detection, not memorisation. Significance is BH-FDR corrected across 15 comparisons (q=0.05). "Known ≫ novel?" is literal for static scanners (no signature → cannot fire) but an upper bound for LLM scanners (see Methodology §4).
Caveat (static): because novel samples are defanged, a rule keying on a real payload misses them partly because the matchable artifact was neutralised, not purely because the behaviour is novel — so static gap is an over-estimate. Caveat (LLM): "known" (organic) is also more blatant than the disguised synthetic, so the gap conflates training-recognition with disguise-difficulty. Note a negative gap (novel ≥ known, e.g. Snyk on some archetypes) is not a paradox — it's the signature of genuine behavioural detection (catching novel as well as or better than known), the opposite of memorisation.
The only fully-unimpeachable data — and the smallest: real, in-the-wild malicious skills are scarce and ephemeral. Our set is n=5, all one archetype (code-execution), so intervals are wide and we treat wild as supporting evidence, not proof. Independent corroboration: a USENIX 2026 study found only 157 malicious in 98,380 skills (~0.16%). See the Wild column in the master table (shown as k/n).
To prove I'm not exempting myself: skillscan, the scanner I built and retired, run at its strict default over the same corpus (a non-graded reference). Both modes are fully offline — static rules and a local fine-tuned model (Qwen2.5-1.5B), exactly the cheap/local/no-GPU bet. At strict it over-blocks (catches a lot but flags ~78% of benign and ~87% of dual-use), so its balance ≈ a coin flip (0.42 static / 0.57 local-ML). The mirror image of the others (which miss at low FP); same bottom line — no real separation. Read the local-ML row with extra suspicion against me: that detector was trained on generated skills of the same lineage as this corpus and largely memorised them, so its recall is a contamination-inflated upper bound. See its rows in the master table (author-reference band).
Populations: 1002 total = 918 base corpus (the scanners' main scores) + 84 independent Skill-Inject (scored separately, above). The 918 base splits 423 malicious + 495 benign/dual-use; malicious by archetype incl. independent: code-execution 200, data-exfiltration 164, agent-hijacking 143 (sums to 507 = 423 base + 84 independent). Real-wild malicious = 5 (all code-execution). Generation provenance (disclosed): the in-house malicious set is LLM-written — organic via our tooling (gpt-4o / Claude Sonnet / DeepSeek; per-sample split not recorded), defanged synthetic via open-weight models (mixtral-8x22b / gemma-2-27b / hermes-3 / llama-3.3-70b / qwen2.5-72b / deepseek-v3.1, ~even split). Because the in-house set is LLM-written, we lead with the independent Skill-Inject numbers, not the higher (self-recognition-flattered) in-house recall.
Not self-recognition: the best detector (phi-4) generated none of the corpus and is scored cross-family; the in-set Qwen scores worst on its own samples; on benign skills run through the same generators + defang, the baselines flag 0% — they key on malice, not machine-generated style. Dual-use labels come from a single automated open-weight judge (llama-3.3-70b), not human inter-rater — but since the FP axis rests on them, a second rater from a different model family (claude-sonnet) agreed 100% (50/50) on a random sample that the dual-use skills are non-malicious (so a flag is a real FP, not a mislabel; human grading of the subset is the gold-standard follow-up — §2.4). Every recall/FP carries a Wilson 95% interval; we never rank on recall alone. SkillSpector & Cisco static layers were graded; their +llm modes ran only on gpt-4o-direct (other backends blocked/limited, §3b).
Inclusion rule: we grade what a normal developer can run without contacting sales or a big spend — FOSS or a free self-serve tier. Tools that are sales-gated or have no self-serve API we cite but cannot benchmark (you can't script a web form). License isn't the bar — accessibility is; a free commercial self-serve tier qualifies. Tested data is never discarded; new free/self-serve scanners welcome (the harness takes an adapter).
Cross-scanner consensus (SkillSpector + Cisco): full agreement on only 3.4% of 467 malicious samples — the subset (of 507 total malicious incl. the 84 independent) on which both returned a non-error verdict (40 are excluded because one scanner errored/declined, per §5). We compute consensus over the two discriminating static scanners only; SkillGate is excluded because it blocks ~everything, so it would trivially "agree" and inflate the number. Since SkillSpector and Cisco both barely fire, low agreement is partly mechanical: read it as "they rarely co-fire," not deep semantic disagreement.
recall = how many real attacks a scanner catches · known vs novel = attacks already documented vs disguised/never-seen (novel is the real test) · false-positive = safe tools it wrongly flags · dual-use = legitimate-but-scary skills (e.g. a real auth tool that reads a token) · static rules = pattern-matching, the cheap/private bet · +llm / cloud = sending the file to a model · refusal = the scanner errored or declined instead of giving a verdict · balance = balanced accuracy, 0.5 = coin-flip.