EVMbench: OpenAI and Paradigm Launch AI Smart Contract Security Benchmark — GPT-5.3 Exploits 72% of Critical Bugs

In February 2026, the AI and blockchain security communities received a signal that is difficult to ignore: a general-purpose AI model, running through a command-line interface (CLI), can now autonomously craft working exploits against more than 70% of real-world smart contract vulnerabilities tested under controlled conditions. The system behind this result is EVMbench — a new open-source benchmark released jointly by OpenAI, crypto investment firm Paradigm, and blockchain security firm OtterSec.

The benchmark arrives at a moment when AI agent capabilities are advancing faster than the security infrastructure designed to counter them. For DeFi protocols managing billions in total value locked (TVL), the implications are immediate.

What Is EVMbench and Who Built It

EVMbench is an open-source benchmark designed to measure how well AI agents can handle real-world smart contract security tasks. It was released publicly in February 2026 by OpenAI in collaboration with Paradigm and OtterSec, and the full benchmark — including tasks, harness tooling, and documentation — is freely available on GitHub.

The dataset underlying the benchmark is what gives it credibility. EVMbench is built from 120 curated high-severity vulnerabilities drawn from 40 professional blockchain security audits, sourced primarily from open Code4rena audit competitions and Paradigm's own Tempo audit process. These are not synthetic or toy examples — they represent real vulnerabilities that have been evaluated by professional auditors in competitive, high-stakes environments.

The explicit goal of EVMbench is to serve as a shared, reproducible measurement standard so that the security community can consistently track how AI agent offensive and defensive capabilities evolve over time. As AI models improve, a stable benchmark allows researchers and defenders to observe the offense-defense balance in a systematic way, rather than relying on anecdotal comparisons.

Three Evaluation Modes: Detect, Patch, Exploit

EVMbench evaluates AI agents across three structurally distinct task modes, each probing a different dimension of security competence.

Detect is the most familiar to anyone who has worked with automated code analysis tools. In this mode, an AI agent audits a smart contract repository and is scored based on its ability to identify known vulnerabilities. The scoring reflects both recall — how many vulnerabilities the agent finds — and the relevance of flagged findings, weighted to reflect the audit rewards associated with each issue.

Patch moves beyond identification to remediation. The agent must modify a vulnerable contract to eliminate the flaw while preserving the contract's intended functionality. Correctness is verified through automated test suites, meaning the agent cannot simply delete code — it must understand the vulnerability well enough to fix it properly.

Exploit is where EVMbench diverges most sharply from conventional static analysis benchmarks. In this mode, the agent must craft a working, end-to-end exploit against a live contract instance running in a sandboxed Ethereum Virtual Machine (EVM) environment. The exploit must successfully extract value or trigger the unintended behavior — partial solutions do not count. This is the mode that most directly mirrors what a malicious actor — human or AI — would actually do in the wild.

The Headline Numbers: GPT-5.3-Codex at 72.2% Exploit Rate

The benchmark results for GPT-5.3-Codex, running via the Codex CLI, are striking. In Exploit mode, the model achieved a 72.2% success rate against the 120 curated vulnerabilities. To put that number in context: when EVMbench was first conceived, the best available AI models could exploit fewer than 20% of the same class of critical Code4rena bugs. Today's result represents a 3.5x improvement.

The progression is even more striking when viewed across a compressed timeframe. GPT-5 — released just six months before GPT-5.3-Codex — scored 31.9% in Exploit mode. The newer model more than doubles that figure in the same test environment.

The Detect and Patch scores tell a different and important story. GPT-5.3-Codex scored 41% in Detect mode and just 28% in Patch mode. These numbers are not trivial, but they reveal a structural asymmetry: the same model that can autonomously exploit nearly three-quarters of tested vulnerabilities can only identify 41% of them and correctly remediate fewer than one in three. AI agents are currently far more capable as attackers than as defenders or repair agents.

This asymmetry is not incidental. Exploitation is a narrowly defined task with a binary success condition — the exploit either works or it does not. Detection and patching require broader reasoning about code intent, edge cases, and correct behavior across a range of inputs, which remains a harder problem for current language models.

Why the Offense-Defense Gap Is Alarming for DeFi

The numbers matter because of what they imply about cost and speed. When EVMbench was released, Hypernative — a blockchain security monitoring firm — noted that AI's exploit success rate had more than doubled compared to models available six months prior, and that the cost of producing a workable attack was dropping rapidly.

The practical consequence is a change in the time horizon of threat response. When an AI agent can iterate on an exploit in seconds, the window between vulnerability discovery and exploitation compresses from days or hours to a timeframe that makes manual human intervention structurally inadequate. A security team that relies on alerts, human triage, and escalation processes is operating on a cycle time that an AI attacker can trivially outpace.

Analysts who have examined the EVMbench results have concluded that autonomous exploitation is becoming cheap, fast, and reliable in realistic smart contract settings. For protocols securing significant TVL, this changes the risk calculus fundamentally. The threat model must now account for adversaries who can automate offensive security work at a speed and scale that was not practical a year ago.

What EVMbench Does Not Measure — and Why That Matters

EVMbench is rigorous and credible, but understanding its limits is essential to properly interpreting its results.

The benchmark covers 120 known vulnerabilities from prior audits. By construction, it cannot evaluate performance against novel attack classes — vulnerability patterns that have not yet been catalogued and studied. The 72.2% exploit rate describes AI performance on a well-defined, finite set of problems, not on the open-ended challenge of finding zero-day vulnerabilities in previously unaudited code.

The sandboxed environment imposes additional constraints. Cross-chain interactions — which underlie many of the most damaging real-world exploits — are not modeled. Precise timing dependencies, which matter in attacks against protocols with time-locked mechanisms or maximal extractable value (MEV) exposure, are not reproducible in the sandboxed setting. Long-term network state, including conditions that accumulate over multiple blocks or transactions, is also excluded.

This means EVMbench results, while significant, likely understate real-world risk in important dimensions. The benchmark measures a subset of what a sophisticated attacker would attempt. A 72% exploit rate on a curated known-vulnerability dataset does not translate directly to a 72% success probability against arbitrary live protocols — but it does demonstrate that the underlying capability is real, advancing rapidly, and available at low cost.

What the Security Community Should Do Now

The security implications of EVMbench results are clear enough that several firms and analysts have moved to concrete recommendations rather than general caution.

Hypernative's analysis is direct: continuous monitoring and automated response must become baseline infrastructure for any DeFi protocol holding material TVL. The specific mechanisms cited include contract pausing triggered by anomaly detection, automated fund relocation to cold storage when suspicious activity is detected, and key rotation as a defensive measure. These are not novel ideas — they exist in various forms — but the EVMbench results make a stronger argument that they should be mandatory rather than optional.

A broader point from analysts is that DeFi security must shift from periodic audit events to continuous processes. An audit conducted at deployment time does not account for changes in the AI threat landscape six months later. The capability curve EVMbench documents — from sub-20% to 72% exploit rate in roughly a year — is an argument for treating security posture as something that requires ongoing maintenance, not a one-time certification.

OpenAI has committed $10 million in API credits to support cyber-defense research, with priority given to open-source security tooling and critical infrastructure. This funding is intended to accelerate the defensive side of the equation — supporting the development of AI-native tools that can match the speed and scale of AI-driven attacks.

What Comes Next for AI-Driven Smart Contract Security

EVMbench is explicitly designed to evolve. As AI agent capabilities advance, the benchmark can expand to include novel vulnerability classes, additional audit sources, and more complex environmental conditions including cross-chain scenarios. The open-source model means that contributions from the security community can extend the benchmark beyond its initial 120-vulnerability scope.

The trajectory documented by EVMbench — from below 20% to 72% exploit capability in approximately a year — defines the central challenge for Web3 security in the current period. The race between offensive and defensive AI capability is not a future concern; it is active and accelerating.

For protocol teams, security firms, and researchers, EVMbench provides something previously unavailable: a consistent, reproducible baseline against which to measure both the threat and the progress of defensive tooling. The offense-defense gap is wide today. Whether it narrows or widens over the next twelve months will depend in significant part on how the security community responds to the data now in hand.