Business

Microsoft Launches Three Foundational AI Models, Expanding Beyond OpenAI Partnership

Microsoft's three new AI models already power Copilot and PowerPoint, with MAI-Transcribe-1 beating OpenAI's Whisper on accuracy across all 25 tested languages at $0.36 per hour.

Marcus Williams•4/3/2026•3 min read

Published 12:02 PM

Listen to this article•0:00 min

Share this article:

Microsoft Launches Three Foundational AI Models, Expanding Beyond OpenAI Partnership — Source: microsoft.ai

Microsoft rolled out three in-house foundation models on April 2, making a pointed move to own more of its own AI infrastructure rather than routing exclusively through OpenAI or other external providers. The three models, MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, are available via Azure Foundry and the MAI Playground, and Microsoft confirmed they are already running inside Copilot, Bing, PowerPoint, and Azure Speech.

The sharpest performance claims belong to MAI-Transcribe-1, a speech-to-text model that Microsoft says achieves the lowest average word error rate on the FLEURS benchmark, the industry-standard multilingual evaluation, across 25 languages. According to Microsoft's own benchmarks, MAI-Transcribe-1 outperforms OpenAI's Whisper-large-v3 on all 25 tested languages, edges out Google's Gemini Flash on 22 of 25, and beats both ElevenLabs' Scribe v2 and OpenAI's GPT-Transcribe on 15 of 25 each. Its average word error rate of 3.8% was accompanied by batch transcription speeds 2.5 times faster than Microsoft's previous Azure Fast offering. The model accepts MP3, WAV, and FLAC files up to 200MB and was designed to handle call centers, conference rooms, overlapping speech, and degraded audio. It starts at $0.36 per hour, a price point Microsoft described as the best performance-per-dollar of any major cloud provider.

MAI-Voice-1 generates 60 seconds of expressive audio in under one second on a single GPU. Developers pay $22 per one million characters and can use the Personal Voice feature in Azure Speech to clone a voice from a 10-second audio sample, though that capability requires approval under Microsoft's responsible AI policies.

MAI-Image-2, which covers both image and video generation, debuted third on the Arena.ai image model leaderboard. Pricing runs $5 per one million tokens for text input and $33 per one million tokens for image output. Microsoft has already begun threading MAI-Image-2 into Bing and PowerPoint for production users.

The integration picture matters as much as the benchmarks for enterprise buyers weighing data residency and governance. Microsoft confirmed MAI-Transcribe-1 is being tested inside Copilot's Voice mode and Microsoft Teams for conversation transcription. MAI-Image-2 is in active rollout across Bing and PowerPoint. What remains developer-only for now is the full Azure Foundry API surface; the models are not yet broadly configurable within enterprise tenants the way Azure OpenAI Service models are, and features including diarization, contextual biasing, and streaming are listed as forthcoming for MAI-Transcribe-1.

Mustafa Suleyman, who leads the company's dedicated AI research unit, framed the releases as part of what he calls "Humanist AI," describing it as "putting humans at the center, optimizing for how people actually communicate, training for practical use." He was equally explicit about the pricing logic: "We're pricing them to be the very best of any hyperscaler. So there will be the cheapest of any of the hyperscalers out there, Amazon. And obviously Google. And that's a very conscious decision."

That pricing philosophy has structural consequences. By controlling its own transcription, voice, and image infrastructure, Microsoft can offer tighter data residency guarantees to regulated industries and reduce the per-query costs that previously flowed to third-party APIs. For enterprise customers who have hesitated to route sensitive audio or proprietary images through external providers, in-house models with Azure-native governance controls represent a materially different procurement calculus than licensing the same capabilities from OpenAI or Google.

Sources:

microsoft.ai techcrunch.com

Know something we missed? Have a correction or additional information?

Submit a Tip