Releases

Needle-rs runs AI tool routing locally in pure Rust and WebAssembly

Needle-rs packed a 26M-parameter tool router into 23 MB, bringing JSON function calls to Rust and WebAssembly without an API key.

Jamie Taylor·5/22/2026·2 min read

Published 07:40 PM

Listen to this article•0:00 min

Share this article:

Needle-rs runs AI tool routing locally in pure Rust and WebAssembly — Source: preview.redd.it

Needle-rs is aiming at a different problem than the usual AI wrapper: it runs tool routing locally, in pure Rust and WebAssembly, around Cactus Compute’s Needle model. The core model is a 26-million-parameter transformer, also described as a Simple Attention Network, distilled from Gemini 3.1. Instead of acting like a chat layer on top of a remote model, it takes a query plus a tool list and turns that input into JSON function calls in a single forward pass.

That design matters because the runtime is small enough to move with the app. Needle-rs says the whole package comes to about 23 MB total, made up of a 258 KB runtime plus a 22 MB model. The project is built to run in browsers, edge workers, CLIs, Python, and no_std embedded targets, which puts it in a very different class from heavyweight local LLM stacks. The architecture docs split the code into needle-core, needle-infer, needle-c, needle-wasm, and needle-cli, with a strict DAG underneath, and the Rust engine is expected to match the Python/JAX reference token-for-token at every decode step.

The benchmark numbers make the pitch sharper. Needle-rs reports a median 283 ms end-to-end load-and-infer time on an Intel i7-1185G7 laptop, while the upstream Needle project says production runs on Cactus at about 6,000 tokens per second for prefill and 1,200 tokens per second for decode. The release is also tuned for deployment constraints that Rust developers recognize immediately: the project says the footprint fits in a service-worker cache or a CDN edge budget.

Under the hood, Needle-rs uses INT4 quantization, AVX2 and NEON acceleration, constrained decoding to keep output valid JSON, and greedy decoding only, which pushes it toward deterministic tool selection rather than open-ended generation. It also supports both flat tool schemas and OpenAI-style schemas, making it easier to slot into existing agent pipelines without rewriting every interface.

Cactus Compute says the model weights and dataset-generation pipeline are open under MIT, so the result is not just a smaller runtime but a portable one. For Rust developers experimenting with local-first AI, Needle-rs is the kind of release that points to a real deployment primitive: a narrow, fast tool router that can live in the browser, on the edge, or inside an embedded binary without dragging in a heavyweight stack.

Know something we missed? Have a correction or additional information?

Submit a Tip