Updates

vLLM Flaw Lets Single Request Crash AI Inference Servers via OOM

One unauthenticated HTTP request is all CVE-2026-34756 needs to OOM-crash a vLLM inference server; the missing guard is a single integer bound on the `n` parameter.

Jamie Taylor•4/12/2026•2 min read

Published 06:20 AM

Listen to this article•0:00 min

Share this article:

vLLM Flaw Lets Single Request Crash AI Inference Servers via OOM — AI-generated illustration

This article contains affiliate links, marked with a blue dot. We may earn a small commission at no extra cost to you.

A vulnerability in vLLM's OpenAI-compatible API server lets an unauthenticated attacker crash the entire process with one HTTP request. No credentials, no exploit kit, no foothold: just a ChatCompletionRequest or CompletionRequest carrying an astronomically large value for the `n` parameter, and the server dies.

The root cause, documented in GitHub Security Advisory GHSA-3mwp-wvh9-7528 and assigned CVE-2026-34756, is a missing upper-bound validator on `n` inside vLLM's Pydantic request models. The `n` parameter controls how many parallel completions the server generates per prompt. With no ceiling enforced, a single malformed request triggers millions of object allocations before it ever reaches the scheduler. That allocation storm blocks Python's asyncio event loop and drives the server process to an immediate out-of-memory crash.

The vLLM project published the advisory on April 6, rating the issue CVSS 6.5, which lands it in the Medium band. The gap between that score and the real-world blast radius is the authentication requirement: there is none. Any network path to the server's HTTP endpoint is a viable attack path, which pushes operational risk to high for publicly-exposed deployments.

vLLM 0.19.0 closes the hole. The fix adds the missing upper-bound validation to both ChatCompletionRequest and CompletionRequest, rejecting oversized `n` values at the input layer before object allocation begins. Teams running any prior release should treat the upgrade as urgent. The attack surface also includes internal and ephemeral test instances, not just production clusters; any reachable HTTP endpoint qualifies.

Operators who cannot patch immediately have partial options. API gateways, WAFs, and ingress controllers that enforce parameter bounds and apply rate limiting can raise the bar against exploitation, but the vLLM project is explicit that none of these substitute for upgrading. Add vLLM to dependency scanning pipelines and emergency patching runbooks if it is not already there.

For teams building on hybrid Rust/Python serving stacks, the advisory carries a precise architectural lesson. vLLM pairs a performance-critical core with a Python API frontend, and the vulnerability lived entirely in the Python layer: specifically in the Pydantic model that deserializes inbound requests. Security hardening that stops at the compiled-code boundary misses the full attack surface. Language boundaries do not contain failures in input validation; the weakest validation point in a hybrid system is the one that gets exploited.

No public exploit code had been reported as of the April 6 advisory. Given that exploitation requires nothing more than network access and a single crafted request body, that gap should not be mistaken for a comfortable runway.

Know something we missed? Have a correction or additional information?

Submit a Tip