News

Rust speaker diarization tool speakrs beats pyannote on Apple Silicon

speakrs hit 7.1% diarization error on VoxConverse dev at 529x realtime on CoreML, edging pyannote while running far faster on Apple Silicon.

Jamie Taylor··2 min read
Published
Listen to this article0:00 min
Rust speaker diarization tool speakrs beats pyannote on Apple Silicon
AI-generated illustration

A Rust diarization engine just made the speed-versus-accuracy tradeoff look a lot less settled. On VoxConverse dev, speakrs posted 7.1% DER at 529x realtime on CoreML, while pyannote registered 7.2% at 24x, a result that landed in a May 27 RustCC roundup and immediately matters to anyone building podcast tools, meeting transcripts, or personal media search in Rust.

The project’s pitch is straightforward: keep speaker separation close to pyannote-class quality, then remove the Python overhead and let native code do the heavy lifting. The speakrs repository describes the project as fast Rust speaker diarization with pyannote-level accuracy, and its pipeline stays in Rust from segmentation and powerset decoding through overlap-add aggregation, binarization, embeddings, PLDA scoring, and VBx clustering. The repo also says there is no Python runtime in the library path, which makes the stack easier to embed when startup time, dependency weight, and deployment size matter. The same repository now lists performance at 312-912x realtime on Apple Silicon and 50-121x on CUDA. avencera/smrze is already pointed to as a small end-to-end app built on the library.

That benchmark comparison matters because VoxConverse is not a toy dataset. It was built from in-the-wild YouTube videos, with political debates and news segments, overlapping speech, and background conditions that make diarization hard. The dataset was introduced in the Interspeech 2020 paper Spot the conversation: speaker diarisation in the wild, and it is still a useful stress test for tools that have to sort out who spoke when. VBx, one of speakrs’ core building blocks, is a Bayesian HMM clustering method for x-vectors and has long been treated as a standard baseline. pyannote.audio, by contrast, remains a Python-first toolkit built on PyTorch, which has helped make it a default choice for diarization work in the broader machine-learning world.

For Rust makers, the practical promise is not just that speakrs is faster. It is that a native stack can get close enough on quality to make throughput the deciding factor for more jobs. That is a real opening for offline transcription, batch podcast indexing, and local meeting analytics, where diarization cost can become the bottleneck long before storage or search does. Where the recordings are messy, the speakers overlap heavily, or the archive has to be as faithful as possible, pyannote-class accuracy still matters. But for many hobby audio tools, speakrs has moved the bar from “can Rust do this?” to “how much faster do you need it?”

This article was produced by Prism’s automated news system from verified source data, official records, and press releases, then run through automated quality and moderation checks before publishing. The system is built and supervised by the people who set the standards it runs under. Read our full AI policy.

Did this article answer your question?

Discussion

More Rust Programming News