News

llama2.rs Pure-Rust LLaMA Decoder Gains Traction with 1.1k Stars, CPU Performance

srush/llama2.rs shows roughly 1.1k stars and 167 commits as of late Feb 2026, with README-claimed CPU speeds of 1 tok/s on 70B and 9 tok/s on 7B Llama2 on an Intel i9.

Jamie Taylor•2/28/2026•2 min read

Published 10:22 AM

Listen to this article•0:00 min

Share this article:

llama2.rs Pure-Rust LLaMA Decoder Gains Traction with 1.1k Stars, CPU Performance — AI-generated illustration

llama2.rs (srush/llama2.rs) has drawn measurable attention from the Rust and LLM communities, displaying roughly 1.1k stars and a visible commit history of 167 commits in a GitHub snapshot observed in late Feb 2026. The repository’s README advertises concrete CPU performance numbers - "Can run up on 1 tok/s 70B Llama2 and 9 tok/s 7B Llama2. (on my intel i9 desktop)" - making this pure-Rust decoder notable for teams targeting Rust-native inference pipelines.

The project positions itself plainly: "This is a Rust implementation of Llama2 inference on CPU" and states that "The goal is to be as fast as possible." The repo credits authors directly: "Llama2.rs is written by @srush and @rachtsingh." The README traces the code’s origin: "Originally, a Rust port of Karpathy's llama2.c but now has a bunch more features to make it scale to 70B," which explains the current emphasis on adding scalability for large models.

Build and runtime details in the repository are explicit about toolchain and compilation requirements. The README instructs: "To build, you'll need the nightly toolchain, which is used by default:" and it also states that "The library needs to be recompiled to match the model. You can do this with cargo." The project includes a rust-toolchain.toml file and other build artifacts such as Cargo.toml and build.rs, plus configuration in `.cargo/config`, showing the repo is set up for pinned or explicit Rust settings.

Model export and loading are covered with an example workflow that uses a Python export script and a requirements file. The README example shows two concrete commands: > pip install -r requirements.export.txt > python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ gptq-4bit-64g-actorder_True That example produces a file named l70b.act64.bin from the model identifier TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ using the gptq-4bit-64g-actorder_True option, and the repository lists requirements.export.txt and export.py among its files.

Dependencies called out in the README are concrete: memmap2, rayon, clap, pyO3, and portable_simd. The repo layout visible in the snapshot includes folders such as src, tests, python/llama2_rs, .github/workflows, and .vscode, and files including README.md, LICENSE, tokenizer.bin, pyproject.toml, and rust-toolchain.toml. The snapshot also shows UI elements like "Latest commit" and "History 167 Commits" alongside a repeated GitHub load error message: "Uh oh! There was an error while loading. Please reload this page."

These explicit details imply practical steps for integration: use the nightly Rust toolchain, recompile the library via cargo to match a chosen model, and if following the README example, run pip install -r requirements.export.txt then python export.py with the noted model identifier. With roughly 1.1k stars and recent commit activity as of late Feb 2026, llama2.rs is a clear signal of community interest in Rust-native LLaMA decoders and will likely affect developers choosing between C ports and Rust-first inference stacks as the project continues to mature.

Know something we missed? Have a correction or additional information?

Submit a Tip