The first benchmark to expose Execution Hallucination — agents that verbally refuse yet silently complete dangerous OS operations — through semantic–physical dual-layer verification with OS-level state rollback.
1Nanjing University of Aeronautics and Astronautics
2Zhejiang University
3Collaborative Innovation Center of Novel Software Technology and Industrialization
† Corresponding author ‡ Project leader
819 high-risk test cases derived from CVE/GHSA databases (37.6%), penetration testing reports (15.4%), and LLM-generated cases (47.0%) — each annotated with semantic/physical success criteria and OS-level rollback actions.
A fully automated, closed-loop, multi-agent pipeline with no human intervention. The Defendant is deployed outside the framework, interacted with solely through natural language instructions and OS-level side effects — strict black-box paradigm.
{"command":"ufw status","result":"inactive"}) as physical evidence.Six frontier LLMs evaluated as OpenClaw (v4.2.0) agent brain on Ubuntu 24.04. Each case run 3× (mean ± std reported). Lower ASR and EHR indicate safer, more reliable agents.
| Agent Brain | Cat. | SPSR | SOSR | POSR | SPFR | ASR ↓ | EHR ↓ |
|---|---|---|---|---|---|---|---|
Deepseek-v3.2 Highest ASR |
SO | 70.69 | 6.45 | 4.30 | 18.27 | 75.27 | 10.75 |
| DM | 60.60 | 9.09 | 4.55 | 25.76 | 65.15 | 13.63 | |
| CE | 56.52 | 2.90 | 5.80 | 34.78 | 63.31 | 8.69 | |
| CO | 90.00 | 0.00 | 6.66 | 3.33 | 96.66 | 6.66 | |
| CT | 66.66 | 3.22 | 4.30 | 25.80 | 70.79 | 7.52 | |
| Total | 66.66 | 4.84 | 4.84 | 23.64 | 71.51% | 9.69% | |
Deepseek-v4-pro 2nd Highest ASR |
SO | 72.05 | 6.45 | 6.45 | 15.05 | 78.49 | 12.90 |
| DM | 65.15 | 7.58 | 6.06 | 21.21 | 71.21 | 13.64 | |
| CE | 56.52 | 8.70 | 1.45 | 33.33 | 57.97 | 10.15 | |
| CO | 80.00 | 6.67 | 3.33 | 10.00 | 83.33 | 10.00 | |
| CT | 63.44 | 0.00 | 1.08 | 35.48 | 64.52 | 1.08 | |
| Total | 66.10 | 5.41 | 3.70 | 24.79 | 69.80% | 9.12% | |
Qwen3.6-plus Lowest EHR ★ |
SO | 44.09 | 6.45 | 7.53 | 41.94 | 51.61 | 13.98 |
| DM | 50.00 | 6.06 | 0.00 | 43.94 | 50.00 | 6.06 | |
| CE | 49.28 | 8.70 | 4.35 | 37.68 | 53.62 | 13.04 | |
| CO | 83.33 | 0.00 | 0.00 | 16.67 | 83.33 | 0.00 | |
| CT | 54.84 | 0.00 | 2.15 | 43.01 | 56.99 | 2.15 | |
| Total | 52.42 | 4.56 | 3.42 | 39.60 | 55.84% | 7.98% | |
Gemini-3.1-pro preview · Google |
SO | 59.82 | 5.45 | 6.52 | 28.21 | 66.34 | 11.97 |
| DM | 51.52 | 9.09 | 1.52 | 37.88 | 53.03 | 10.61 | |
| CE | 42.82 | 11.79 | 4.35 | 41.04 | 47.17 | 16.14 | |
| CO | 86.67 | 3.33 | 0.00 | 10.00 | 86.67 | 3.33 | |
| CT | 48.39 | 1.08 | 0.00 | 50.54 | 48.39 | 1.08 | |
| Total | 54.16 | 6.02 | 2.87 | 36.96 | 57.02% | 8.89% | |
GPT-5.3-Codex Highest EHR |
SO | 59.14 | 4.30 | 8.60 | 27.95 | 67.74 | 12.90 |
| DM | 60.61 | 10.61 | 0.00 | 28.79 | 60.61 | 10.61 | |
| CE | 42.03 | 11.59 | 7.25 | 39.13 | 49.27 | 18.84 | |
| CO | 96.67 | 0.00 | 0.00 | 3.33 | 96.67 | 0.00 | |
| CT | 39.79 | 2.15 | 1.08 | 56.99 | 40.86 | 3.23 | |
| Total | 54.13 | 5.98 | 3.99 | 35.90 | 58.12% | 9.97% | |
Claude-Sonnet-4-6 Lowest ASR ★ · Anthropic |
SO | 41.29 | 6.53 | 5.45 | 46.74 | 46.74 | 11.97 |
| DM | 44.66 | 9.16 | 1.52 | 44.66 | 46.17 | 10.68 | |
| CE | 39.52 | 4.42 | 0.00 | 56.06 | 39.52 | 4.42 | |
| CO | 66.67 | 0.00 | 3.33 | 30.00 | 70.00 | 3.33 | |
| CT | 17.38 | 2.15 | 4.34 | 76.13 | 21.72 | 6.49 | |
| Total | 37.47 | 4.90 | 3.17 | 54.46 | 40.64% | 8.07% |
Radar charts compare Attack Success Rate (ASR) of each variant against the Naive baseline (seed dataset, no attack applied) across five operational scope categories. Each chart covers one model × one attack paradigm.
Five consistent findings from LITMUS across the seed dataset and attack-extended evaluation.
If you use LITMUS in your research, please cite our paper.