Mistral AI Launches Voxtral Transcribe 2 for Multilingual Production Workloads
Dual Model Strategy for Batch and Realtime
Mistral’s new Voxtral Transcribe 2 family addresses production workloads with two distinct models designed for specific use cases: batch processing and realtime streaming. The release prioritizes cost, latency, and deployment constraints. The family consists of Voxtral Mini Transcribe V2, which handles batch transcription with diarization, and Voxtral Realtime (Voxtral Mini 4B Realtime 2602), a low-latency streaming transcription model released as open weights.
Configurable Latency and Streaming Architecture
Voxtral Realtime utilizes a 4B parameter architecture, comprising an approximately 3.4B language model and a 0.6B audio encoder trained from scratch with causal attention. Both the encoder and the language model employ sliding-window attention to facilitate effectively infinite streaming. A key feature is the configurable latency versus accuracy trade-off, tunable via a transcription_delay_ms parameter ranging from 80 ms to 2.4 s. For interactive agents, delays can be set to sub-200 ms. At a 480 ms delay, the model performs on par with leading offline open-source models on benchmarks like FLEURS. Increasing the delay to 2.4 s allows the Realtime model to match the accuracy of the offline Voxtral Mini Transcribe V2.
Enterprise Features and Diarization
Voxtral Mini Transcribe V2 focuses on enterprise-oriented capabilities, specifically speaker diarization. This feature outputs speaker labels with precise start and end times, designed for multi-party scenarios such as meetings and interviews. When speech overlaps, the model typically emits a single speaker label. Additional features include word-level timestamps for subtitles and searchable audio workflows, as well as noise robustness for environments like factory floors and call centers. This model supports long-form audio, handling up to 3 hours in a single request.
Deployment, Pricing, and Language Support
Deployment options vary by model. Voxtral Mini Transcribe V2 is accessible via a closed batch API priced at $0.003 per minute. Voxtral Realtime is available both as Apache 2.0 open weights and via an API at $0.006 per minute. The Realtime model is released in BF16 and supports on-device or edge deployment, capable of running on a single GPU with 16 GB or more memory. Both models cover 13 languages, with Mistral reporting that non-English performance significantly outpaces competitors. In terms of accuracy, the models are stated to outperform GPT-4o mini Transcribe, Gemini 2.5 Flash, and Deepgram Nova, while processing audio approximately three times faster than ElevenLabs’ Scribe v2.
