Releases

Shimmy Rust app delivers fast, tiny local AI serving with OpenAI compatibility

Shimmy packs OpenAI-compatible local AI serving into a 4.8MB Rust binary, with 100ms startup and roughly 50MB RAM usage.

Sam Ortega2 min read
Published
Listen to this article0:00 min
Share this article:
Shimmy Rust app delivers fast, tiny local AI serving with OpenAI compatibility
AI-generated illustration
This article contains affiliate links, marked with a blue dot. We may earn a small commission at no extra cost to you.

A 4.8MB Rust binary that can serve local AI models with OpenAI-compatible endpoints is a real gut-check for how much overhead the modern LLM stack has carried around. Shimmy is pitching exactly that: fast local inference, about 100ms startup, roughly 50MB RAM usage, and a setup simple enough to feel almost suspicious the first time you run it.

The project, by Michael-A-Kuykendall, is framed as a single-binary Rust inference server for GGUF models, with support for GGUF and SafeTensors, hot model swapping, and automatic model discovery. The crate listing says it can be installed with cargo install shimmy features huggingface, then used with shimmy serve and shimmy list. That puts it squarely in the “bring your own model, keep the rest of the stack light” camp.

Shimmy’s bigger selling point is compatibility. Its documentation says it works with OpenAI SDKs and tools across Python, Node.js, curl, VS Code extensions, Cursor, and Continue.dev. That means the appeal is not just that the server is small, but that it can slot into workflows already built around OpenAI-style endpoints without forcing a rewrite of client code or a detour through a heavier runtime.

The automation angle is just as important. Shimmy auto-discovers models from the Hugging Face cache, Ollama, and local directories, and it can also detect LoRA adapters. In practice, that turns it into a local model launcher that is trying to remove the fiddly parts: no hand-built model registry, no elaborate configuration file, and no Python dependency chain holding the whole thing together.

Related stock photo
Photo by Matheus Bertelli

The project author has also pushed the performance story hard. In a Hacker News post, Michael-A-Kuykendall said v1.2.0 added native SafeTensors support, letting Shimmy load .safetensors files directly in Rust. That same post claims 2x faster model loading, zero Python dependencies, continued 5MB size, GPU and CUDA support when rebuilt with CUDA enabled, and the option to use Shimmy as a Rust library instead of only as a standalone executable.

The contrast with Ollama is obvious. Ollama’s documentation says its API is available by default at localhost/api, and its quickstart spans models from 1B and 2B up to 405B, with memory guidance of at least 8GB RAM for 7B models, 16GB for 13B, and 32GB for 33B. Google’s Gemma guide also points to Ollama and llama.cpp as practical ways to run quantized GGUF models on laptops or other small devices without a GPU. Shimmy’s pitch is narrower but sharper: keep the local-LLM promise, strip out the bloat, and make ordinary machines feel plenty big enough.

Know something we missed? Have a correction or additional information?

Submit a Tip

Never miss a story.
Get Rust Programming updates weekly.

The top stories delivered to your inbox.

Free forever · Unsubscribe anytime

Discussion

More Rust Programming News