SkillHarm is a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. It evaluates two attack scenarios:
- Fixed-Payload Poisoning (FPP). A fixed poisoned skill directly compromises any task session that invokes it.
- Self-Mutating Poisoning (SMP). An initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse.
And it defines 12 risk types organized around the agent workflow components targeted by the harm: data pipelines, system environments, and agent autonomy.
To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Current agents remain vulnerable: ASR up to 86.3% (FPP) and 69.3% (SMP).
- 06/01/2026: Paper, benchmark, and code release.
SkillHarm/
├── paper.pdf # camera-ready manuscript
├── README.md # you are here
├── figures/ # introduction figures used in this README
└── AutoSkillHarm/ # automated attack-construction pipeline
├── risk_taxonomy.json # 12-risk taxonomy
├── fixed-payload-poisoning/ # FPP scenario
└── self-mutating-poisoning/ # SMP scenario
| Scenario | Lifecycle window | Mechanism |
|---|---|---|
| Fixed-Payload Poisoning (FPP) | Single session | Poisoned skill ships malicious content at installation; harm fires during the first invocation. |
| Self-Mutating Poisoning (SMP) | Cross-session | First task runs harmlessly but silently mutates persistent skill content via, e.g., an atexit hook; harm materializes only when a later task reuses the modified skill. |
SMP exposes a failure mode single-session benchmarks cannot observe.
Risks are organized by the workflow component through which harm materializes. The canonical definitions live in AutoSkillHarm/risk_taxonomy.json.
| Category | Risk types |
|---|---|
| Data Pipeline Exploitation | Data Exfiltration · Output Manipulation · Poisoning |
| System Environment Exploitation | Privilege Escalation · Unauthorized File Mod. · Backdoor Injection · Denial of Service · Malware Deployment · System Corruption |
| Agent Autonomy Exploitation | Goal Hijacking · Anti-Forensics · Proxy Attack |
A coding agent drives three sequential stages, each specified by a natural-language harness that defines stage inputs, objectives, constraints, required outputs, and review criteria:
- Attack Target Selection. For FPP, identify reachable skill files from baseline agent read-rates; for SMP, select task pairs sharing skills via consensus voting among selector agents.
- Attack Design. Instantiate a
(target, risk type)combo into a contextualized payload plus a deterministic attack-success evaluator. Includes iterative self-refinement gated by an LLM-based safety scanner. - Quality Filter. Replay each candidate against representative victim agents and have a reviewer agent emit a keep/discard verdict on goal validity, evaluator faithfulness, and payload quality.
Surviving candidates flow into victim evaluation, which runs the attack end-to-end against six model-harness configurations across four agent harnesses (Claude Code, Codex, Gemini CLI, OpenCode) and judges trajectories with an LLM judge for cASR (conditional ASR, given engagement) and ARR (attack refusal rate). The primary ASR is computed deterministically per sample with no LLM call.
git clone https://github.com/OSU-NLP-Group/SkillHarm.git
cd SkillHarm-
Harbor — containerized agent-task execution
-
Docker — 50 GB+ free disk recommended for end-to-end victim evaluation
-
Python 3.12+
-
Provider credentials, set as environment variables:
Variable Used for OPENAI_API_KEYOpenAI / Codex / GPT-based judge ANTHROPIC_API_KEYor AWS Bedrock credsClaude Code OPENROUTER_API_KEYOpenCode with OpenRouter models GOOGLE_API_KEYGemini via direct API AZURE_OPENAI_API_KEY+AZURE_OPENAI_ENDPOINTAzure-routed agents and scanner
The pipeline operates on the SkillsBench task catalog. Each task follows the Harbor layout — a per-task directory containing task.toml, environment/, solution/, tests/, and the skill package(s) the task uses. Driver scripts reference SkillsBench task IDs and expect the benign tasks to be staged inside each scenario directory:
AutoSkillHarm/
├── fixed-payload-poisoning/
│ └── tasks/<task_id>/ # per-task Harbor layout
└── self-mutating-poisoning/
└── task_pairs/<pair>/
├── task_a/ # Task A Harbor layout
├── task_b/ # Task B Harbor layout
└── shared_skills/ # shared skill package
Each pipeline stage is invoked via the drivers documented in its scenario README. The overall flow:
FPP — see AutoSkillHarm/fixed-payload-poisoning/README.md
cd AutoSkillHarm/fixed-payload-poisoning
# Stage 1: select injection targets
python attack_target_selection/scripts/compute_baselines.py
python attack_target_selection/scripts/sample_risks.py
# Stage 2: run designer agent
bash attack_design/scripts/run_all.sh
# Stage 3: quality filter (replay + reviewer agent)
bash quality_filter/scripts/run_all_filter.shSMP — see AutoSkillHarm/self-mutating-poisoning/README.md
cd AutoSkillHarm/self-mutating-poisoning
# Stage 1: consensus voting over candidate task pairs
bash attack_target_selection/scripts/run_agents.sh
python attack_target_selection/scripts/vote.py
# Stage 2: designer agent (cross-session payload + atexit hook)
bash attack_design/scripts/run_all.sh
# Stage 3: quality filter
bash quality_filter/scripts/run_all_filter.shAfter construction (or with the released benchmark samples), drive victim evaluation from each scenario's victim_eval/:
# FPP — single agent
bash AutoSkillHarm/fixed-payload-poisoning/victim_eval/scripts/eval_command_claude.sh
# FPP — with defensive system prompt
bash AutoSkillHarm/fixed-payload-poisoning/victim_eval/scripts/eval_command_claude_defense.sh
# FPP — LLM judge for cASR / ARR
bash AutoSkillHarm/fixed-payload-poisoning/victim_eval/scripts/compute_metrics.sh
# SMP — full A → snapshot → B cycle
bash AutoSkillHarm/self-mutating-poisoning/victim_eval/scripts/eval_all.sh
bash AutoSkillHarm/self-mutating-poisoning/victim_eval/scripts/compute_metrics.shEach scenario README documents per-agent wrappers and defense variants.
If you find this work useful, please consider starring our repo and citing our paper:
@article{ning2026skillharm,
title = {SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction},
author = {Ning, Yuting and Zhang, Zhehao and Lal, Yash Kumar and Gou, Boyu
and Li, Junyi and Ruan, Weitong and Ye, Chentao and Gupta, Rahul
and Yang, Diyi and Su, Yu and Sun, Huan},
journal = {arXiv preprint arXiv:2606.02540},
year = {2026},
eprint = {2606.02540},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
