skillscan.sh
independent scoreboard for AI-skill security scanners

skillscan.sh Scoreboard — Methodology (pre-registered v1)

Pre-registered before running, so the schema/metrics/tests can't be retrofit to a conclusion. We grade skill-security scanners; we are not a scanner and have nothing to sell.

1What we measure

Each scanner, in each mode (static / +llm), is scored on a curated corpus stratified by archetype × provenance × difficulty tier, plus benign controls. We report recall, false-positive rate, and refusal/error rate as three separate columns (a miss ≠ a refusal), each with a confidence interval, and never collapse to a single rank.

Scope (stated to avoid a false dichotomy). This benchmark evaluates static-time and read-time detection only — pattern/rule scanners and one-shot LLM reads of the skill before it runs. Dynamic runtime analysis (sandboxed execution, syscall/eBPF monitoring, behavioral tracing) is a different category, complementary to ours, and is out of scope — its absence here is deliberate, not an oversight. "Static rules vs. LLM reading" is the read-time comparison; it is not a claim that read-time is the only viable approach.

2Schema (the axes)

2.1Archetype — 3 behavioral classes (what the attack does)

ArchetypeBehaviorAbsorbs (attack_vector)
code_executionget attacker code runningsupply_chain install-exec, command_injection, rce, code_injection
data_exfiltrationsteal secrets/data/telemetryinformation_disclosure, data_exfiltration, passive_surveillance, credential_theft
agent_hijackingsubvert the agent's goals/instructionsprompt_injection, instruction override, tool/MCP poisoning

Behavioral (not vector-based) because behavioral detection is what we're testing. The finer attack_vector is kept as a secondary tag (so supply-chain etc. stay queryable).

2.2Provenance — 3 levels (the credibility axis)

ProvenanceDefinitionRole
wild_verbatimactual malicious skill file captured from a real repo/marketplace (has corpus-url)strict, unimpeachable recall
organic_authoredwritten by our tooling, anchored to a real documented campaign (real CVE/IOC/reference), disclosed as authoredreal-threat-grounded recall
synthetic_novelnovel, defanged variants generated open-weight to exercise the same behavior with unseen surface form; held-out, never publishedthe generalization probe (§4)

synthetic_novel is generated on an open-weight model (Modal), never on a commercial API (Principle 2), uses inert placeholders (attacker.invalid, benign-decoding base64, no working payloads — detection fixtures, like EICAR, not weapons), and is kept private (anti-gaming).

2.3Difficulty tier (orthogonal)

overt / obfuscated / indirect — tracked as a separate axis, reported per tier. v1.1 disclosure: our in-house malicious corpus is overt-dominated, so the in-house tier axis is effectively collapsed and we do not report an in-house overt-vs-indirect split. The independent Skill-Inject set carries the real tier signal (36 overt / 48 indirect), and the headline per-tier results (§2.6) are reported there — which is the harder, more honest place to measure it anyway.

2.4Benign controls

benign with a dual_use sub-flag (legit Stripe/SSH/MCP skills that should not be flagged).

Dual-use adjudication (stated plainly, including its limits). The boundary is intent + disclosure, not the mere presence of risky operations: a skill that reads a secret and makes a network call for its stated, consented purpose (e.g. a Jira auth skill calling the Jira API) is dual-use-benign; covert exfiltration / undisclosed remote-code-execution is malicious. The dual_use_recovered set was labeled by an independent open-weight design-intent judge (scripts/verify_wild.py, llama-3.3-70b — not a graded scanner, not a board baseline) over heuristic-flagged-but-plausibly-legitimate harvested skills; refusals/parse-failures go to review, never a silent label. This is an automated single-judge process, not human inter-rater adjudication — a known limitation; every dual-use verdict + reason is recorded for human glance (WILD_RECOVERY_REVIEW.md). A vendor that disputes a specific dual-use call is disputing this intent-not-operations criterion; the per-sample reasons are auditable.

Second-rater check (the FP X-axis rests on these labels, so we validated them). A random sample of 50 dual_use skills was re-graded by an independent second rater from a different model family (claude-sonnet-4-6, not the llama-3.3-70b labeler), asked malicious-vs-benign cold: agreement was 100% (50/50) — the second rater confirmed every sampled dual-use skill is non-malicious. So a scanner that flags one is a genuine false-positive, not a mislabeled true-positive; the X-axis is sound. Residual caveats: this is cross-model agreement, not human (human grading of the subset remains the gold-standard follow-up), and a frontier model validating labels it is also scored against is not fully independent — but 100% cross-family agreement is strong evidence the labels are not the weak point.

2.5No in-file labels

Labels live in manifest.jsonl; content is label-stripped and (for organic) de-wrapped of its corpus-metadata frontmatter before any scanner sees it — no leakage.

2.6Independent validation + generation provenance (the honest measurement)

A corpus we generate is, by construction, not independent: an LLM that helped write the malicious samples can recognize them rather than detect them, which flatters LLM-reading. So the headline recall is measured on a set we did not author — Skill-Inject (arXiv:2602.20156), 84 published malicious cases (36 overt / 48 indirect/contextual injections), reconstructed into runnable multi-file skills and scored unchanged (provenance published_independent, n per tier as stated). The LLM controls read the full multi-file skill — an earlier truncation defect that hid payloads buried at the end of long SKILL.md files was found and fixed; all numbers are on full content. Cisco's per-cell refusals are excluded per §5, with the count-as-misses alternative shown inline.

The finding is monotone in model capability — then it plateaus. Five LLMs were run on Skill-Inject via direct provider APIs (OpenAI and Anthropic), generic de-primed prompt, temp 0, zero refusals. Recall climbs the cost ladder then hits a wall: claude-opus-4.8 (81%, 68/84) is statistically tied with claude-sonnet-4.6 (82%, 69/84) — McNemar exact p≈1.0, agreement 75/84 — so the two best frontier models converge at a ~82% ceiling and go no higher. The ladder: gpt-4o-mini 23%, claude-haiku-4.5 29%, gpt-4o 38%, claude-sonnet-4.6 82% (Wilson floor 73% — above every static/cheap method's ceiling). Static rules sit at 13–32%. So the cheap/local/private options (static + small models) all fail under 40%, and only the flagship frontier model clears bar — detection quality tracks capability + cost, and "use an LLM" is meaningless without naming the model.

"Capability" here means capability-as-deployed, not raw reasoning horsepower. The 44-pt gpt-4o→Sonnet gap is very likely not pure parameter count or reading-comprehension: Anthropic's safety-alignment pipeline (Constitutional AI / RLHF) over-indexes on agentic misuse and jailbreak detection, so Claude may be acting partly as a purpose-tuned security classifier where gpt-4o is tuned as a general assistant. We attribute the result to flagship capability combined with safety-alignment weighting, and don't claim to separate the two; both are properties a buyer gets (or doesn't) off the shelf, which is what the board measures.

Contamination note on the winner (and its limits): Claude is one of the models that helped author our in-house organic set, so a Claude score on our own corpus would be suspect — which is why we lead with Skill-Inject, a different group's published benchmark we authored none of. That defeats the our-authorship recognition vector, and no more: we do not claim the 82% is "uncontaminated" in the absolute. Two residual confounds we cannot fully exclude — (a) we can't verify Skill-Inject's authors' own generation pipeline, and (b) a public arXiv benchmark (2602.20156) could in principle have entered a model's pretraining (the §4 memorisation mechanism cuts both ways). What argues the 82% is capability, not recognition: Skill-Inject is built from a fixed set of human-authored injection templates (their published obvious_injections.json (36) + contextual_injections.json (48) — instruction strings with explicit goal/judge metadata, templated unchanged into base skills), not free-form LLM prose, so same-family generation is implausible by construction; and the same inputs span 29% (claude-haiku) → 38% (gpt-4o) → 82% (claude-sonnet) — a 44-pt gap between the cheap models and the flagship that no memorisation story explains. We flag the residual and lead with this set anyway, because it is still the cleanest available: free of the our-authorship contamination that taints every in-house number.

Generation provenance of the in-house set (disclosed in full): organic malicious was authored by our tooling (generate_training_data.pygpt-4o / Claude Sonnet / DeepSeek; per-sample model not recorded); defanged synthetic was generated on open-weight models only (mixtral-8x22b, gemma-2-27b, hermes-3, llama-3.3-70b, qwen2.5-72b, deepseek-v3.1 — Principle 2, ~even split). Because the organic set is commercial-LLM-authored, the frontier baselines are not contamination-clean on it — so we do not headline the (higher) in-house LLM recall; the independent Skill-Inject number is the ceiling we stand behind. Cross-family scoring + the generated-benign control mitigate same-family recognition for the open-weight baselines, but do not fully clean the frontier-on-organic case — which is exactly why the independent set, not the in-house one, is the headline.

3Scanners & modes

Inclusion rule — the access bar. We grade what a normal developer can actually run: free or cheap, self-serve, no enterprise sales call, no big spend. If Jenny or Johnny OpenClaw can run it without contacting sales, it's eligible. Tools that are sales-gated or have no self-serve API we cite but cannot benchmark (you can't script a captcha'd web form). License isn't the criterion — accessibility is; a free self-serve commercial tier qualifies.

Graded (pinned; versions feed the cache key):

Cite-and-flag (acknowledged, not benchmarkable here — no self-serve API): ESET AI Skills Checker (web-form only), Mitiga Skillgate (account-gated), and closed/hosted SandyClaw / Mobb. Published-benchmark references (BIV, SkillSieve) are cited in §8. Tested data is never discarded — anything we ran is shown, labeled by access tier.

+llm backend — what we actually ran, and why it changed. Our intent was apples-to-apples on our open-weight Modal endpoint (Principle-2-clean). That backend could not run the products' own +llm modes (§3b: Cisco's required json_schema → HTTP500 on the AWQ endpoint; SkillSpector's multi-call path timed out on ~99% of a corpus-scale run). The only backend that ran both products' +llm to completion was a direct frontier API (gpt-4o, temp 0) — so the Product LLM modes rows are reported on gpt-4o direct and labeled as such on the board, not on the open-weight endpoint. The LLM-read baselines (Qwen-72B in-set, phi-4 disjoint) are the open-weight controls; Snyk runs on its own cloud LLM. We never silently mix backends: every LLM row names its backend. The backend itself is a finding (§3b), and running both products against the same gpt-4o isolates the integration (prompt/wiring) as the variable — which is exactly where they diverge (~83% vs ~4% known-recall).

3aLLM-read baseline — exact prompt & method (open for scrutiny)

The baseline is a deliberately cheap, one-shot control: hand the model the skill text and ask one question. It is not a product and not a recommendation — it is a floor on "what does a single LLM read catch?" The exact, unedited method:

Why generic, not primed (a confound we found and removed). An earlier prompt named our three archetypes (credential-exfil / RCE / prompt-injection). An ablation (scripts/baseline_prompt_ablation.py, phi-4, n=90/group) showed that primed-vs-generic inflated recall by +29pp (89% vs 60%) and FP too (17% vs 1%) — the named taxonomy primed the model with the corpus's answer space, an advantage no general product is given. We therefore use the generic prompt as canonical. Consequences, stated plainly:

  1. The "baseline beats Snyk" claim does not survive de-priming — generic phi-4 (~60% recall / ~1% FP) does not out-recall Snyk (72–81%); it sits at a lower-FP operating point. We do not claim the baseline beats the product.
  2. The "LLM-read beats static rule scanners" finding is robust — generic ~60% still exceeds SkillSpector/Cisco (~6–8%) by roughly 10×.
  3. The baseline is a control / floor, never a deployment recommendation. Read recall with the FP column; we never rank on recall alone.

The prompt, decode rule, models, the primed-vs-generic ablation, and per-sample verdicts are all in the open-source harness (scoreboard/adapters/llm_baseline.py, scripts/baseline_prompt_ablation.py) for independent audit.

Independent-set priming ablation (verbatim prompts; the "17%" result). On the frontier independent runs we also tried an injection-aware system prompt. Both are published unedited (scripts/score_skillinject_llm.py):

With the primed prompt, gpt-4o's pooled recall fell (38% → 17%). We do not read this as "prompting can't help"; the honest reading is narrower: standard adversarial priming degraded performance on this task — most plausibly safety-filter / over-cautious-classification over-triggering on the longer risk-laden instruction, pushing borderline calls toward refusal-shaped or BENIGN-default outputs. The defensible claim is that out-of-the-box reasoning capability, not prompt engineering, is the primary differentiator here — a hand-tuned prompt per model might recover some of it, but that is a per-model engineering effort no off-the-shelf deployment gets for free. We report the de-primed generic number as canonical and flag the primed result as an ablation, not the headline.

3bBackend availability — an operational caveat (a finding in itself)

LLM-backed scanning needs an LLM backend, and every backend option has friction for malicious-content analysis — which makes these tools genuinely hard to run in practice:

The finding: a scanner's "LLM mode" is only as deployable as the backend you can provision for exactly the malicious content it must analyze — and none of the three routes is friction-free. A vendor shipping an LLM mode inherits this. We report each scanner's +llm results under the backend we could actually run, and label it.

4The headline metric — Generalization Gap

For each scanner × archetype:

Gen-Gap = recall(wild_verbatim + organic_authored) − recall(synthetic_novel)

Wild/organic attacks are public (known campaigns, CVEs) → a scanner can score high by memorizing IOCs. synthetic_novel is unseen and defanged (no known IOC to match) → it isolates behavioral detection. A large positive gap = memorization, not detection. This is the scoreboard's core, field-level finding ("prove or disprove the category").

What a known≫novel gap (the gen-gap "verdict" column on the board) means depends on the scanner (read it carefully):

5Statistics (the rigor)

6Outputs (digestible, in order)

  1. Recall-vs-FP frontier scatter (the in-house recall/FP trade-off; the independent winner, free of our-authorship contamination, is on the Skill-Inject set, §2.6).
  2. Generalization-gap chart (wild/organic recall vs synthetic recall, per scanner).
  3. Per-scanner report cards — archetype × tier matrix, each cell recall [Wilson CI], plus mode-lift (static→+llm) and the refusal column.

7Reproducibility & integrity

Pinned scanner versions + corpus version + run date stamped on every result. Methodology + aggregate results are public; the malicious corpus (esp. synthetic_novel) stays private (anti-gaming). Vendor neutrality: no preferential treatment, no right of reply, no methodology accommodation; an optional vendor note may be posted at our discretion (never alters scoring); corrections (factual/ method errors) accepted from anyone. /reproduce ships commands + the pinned versions.

This scoreboard measures detection; it builds on independent work that characterises the threat. Most directly, Liu et al., "Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild (USENIX Security 2026; arXiv:2602.06547) analysed 98,380 skills across two registries and confirmed 157 malicious ones (~0.16%) spanning 13 techniques, dominated by credential-theft-via-RCE and agent-manipulation-via-adversarial-instructions. We lean on and corroborate that work three ways:

Two detection-side benchmarks we cite but do not re-run (cited from §3):

Pointers to other prior art are welcome (corrections policy, §7).