Hugging Face Candle Adds Native Rust Support for New Google Model
Hugging Face's Candle framework added native Rust inference for Google's Gemma 4, just two days after the Apache 2.0 model family dropped on April 2, 2026.

Hugging Face's Candle framework landed a native Rust implementation of Google's Gemma 4 in commit #3443, merging into the `huggingface/candle` repository around April 4, roughly 48 hours after Google DeepMind published Gemma 4 under an Apache 2.0 license. The turnaround signals something more than routine maintenance: the Candle maintainers are actively tracking the open-model release cadence and wiring up first-class Rust inference paths before the community has had time to file its first bug reports.
Gemma 4 is a multimodal family built on the same research foundation as Gemini 3. It ships in four configurations covering a wide span of hardware: the edge-optimized E2B and E4B variants, which activate roughly 2 and 4 billion effective parameters respectively during inference, plus a 26-billion-parameter mixture-of-experts variant and a 31-billion-parameter dense model. All four accept text, image, and audio inputs. That breadth makes native runtime support non-trivial, and the Candle commit builds it directly into `candle-examples` as a `gemma4` example rather than relying on Python bindings or a thin wrapper.
The addition dropped alongside broader activity across the codebase. The `candle-flash-attn` and `candle-kernels` subcrates received version bumps in the same early-April window, and a separate work-in-progress pull request, PR #3424, is pushing forward an initial ROCm backend implementation. Candle already carries CUDA and Metal acceleration paths inside its multi-crate workspace (candle-core, candle-nn, candle-transformers, candle-examples), and ROCm support would extend that hardware coverage to AMD GPUs running the open compute stack.

For teams running inference without a Python runtime, particularly in edge deployments, serverless functions, or minimal container images, the Gemma 4 addition materially expands what Candle can serve. The `cargo run example gemma4 features metal` invocation pattern that users have already been testing with the `google/gemma-4-E2B-it` model ID illustrates the kind of zero-overhead, dependency-light deployment story that makes Rust inference backends attractive in the first place. Where a Python stack pulls in transformers, accelerate, and a half-dozen CUDA libraries, a compiled Candle binary carries only what it links.
Gemma 4 also introduces configurable visual token budgets (options at 70, 140, 280, 560, and 1,120 tokens) to let callers trade image resolution for compute, an architecture detail that Rust implementors need to surface correctly or risk silent quality degradation. That Candle's maintainers shipped the implementation within days of the model's public release suggests the project has the upstream access and contributor bandwidth to keep pace as Google DeepMind and others push new open-weight releases in 2026.
Know something we missed? Have a correction or additional information?
Submit a Tip

