Microsoft Built Its Own AI Models. Here Is What They Actually Do.

Microsoft just shipped three production-grade AI models under its MAI label, delivering on a strategy that has been quietly taking shape since Mustafa Suleiman joined as CEO of Microsoft AI. The models cover speech recognition, voice synthesis, and image generation. They are available today on Azure Foundry and the MAI Playground. And they are notably cheaper than comparable offerings from OpenAI and Google.

Here is what each model does, how it is priced, and why the timing matters.

MAI-Transcribe-1: Faster, Cheaper Speech Recognition

MAI-Transcribe-1 transcribes spoken audio across 25 languages. Microsoft claims it is 2.5 times faster than its existing Azure Fast offering, the company's current quickest transcription product. The model is designed for noisy real-world conditions: call centers, conference rooms, and similar environments where background audio complicates clean recognition.

Microsoft says it is testing MAI-Transcribe-1 integrations with Copilot and Teams. Pricing starts at $0.36 per hour of transcribed audio, roughly comparable to AWS Transcribe's standard rate and cheaper than premium Azure speech tiers.

The 25-language coverage is meaningful for enterprise customers operating across markets. This is not a research preview. It is being positioned as production infrastructure.

MAI-Voice-1: Custom Voice Generation

MAI-Voice-1 generates audio from text at speed: up to 60 seconds of audio output in one second of processing time. It also supports custom voice profile creation, which is the capability most enterprise buyers have been waiting for in a Microsoft-native product.

Pricing is character-based at $22 per million characters. That is competitive with Google Text-to-Speech at high volumes, and positions MAI-Voice-1 for workloads like interactive voice response systems, AI-powered customer service agents, and narration tooling.

The model is available through Microsoft Azure Foundry. Microsoft has paired the release with guidance under its Azure Responsible AI framework covering voice cloning constraints and acceptable use.

MAI-Image-2: Top-Three Ranking at a Lower Price Point

MAI-Image-2 was quietly released on March 19 through the MAI Playground, then rolled out this week to Bing image generation and PowerPoint's Designer feature, giving it broad consumer distribution almost immediately.

Microsoft says the model currently ranks in the top three on the Arena.ai image generation leaderboard, which aggregates human preference votes across models rather than relying on internal benchmarks. That places it alongside DALL-E 3 and Midjourney's recent releases based on independent evaluation.

Pricing: $5 per million tokens for text input and $33 per million tokens for image output. DALL-E 3 via the OpenAI API costs considerably more per image at comparable quality tiers. By Arena rankings, MAI-Image-2 undercuts both OpenAI and Google Imagen on cost.

Why Microsoft Is Building Its Own Models

Microsoft still holds a reported 49% stake in OpenAI and continues integrating GPT models across its products. That relationship is not ending. But having in-house alternatives changes the negotiating position.

Suleiman acknowledged the current limits in a recent interview: Microsoft is "not able to build models in the very largest scale yet," but the compute infrastructure is coming. The company is developing the Maia 200 AI accelerator chip and the Fairwater data center network to support its own training workloads, with capacity expected to ramp through 2026.

The three MAI models are not competing with GPT-4o on general reasoning tasks. They are production-grade specialized models in three areas where Microsoft has clear deployment paths: speech through Teams and Copilot, voice through enterprise customer service, images through Bing and Office. Each targets a specific revenue stream.

The pattern here is deliberate. Rather than building a frontier general-purpose model from scratch and competing head-on with OpenAI, Google, and Anthropic, Microsoft is assembling modular components that extend Copilot and give Azure customers more options at lower cost. For developers already running workloads in the Azure ecosystem, Microsoft-native transcription, voice, and image generation no longer routes through OpenAI. The cost and latency advantages are worth evaluating.

According to TechCrunch's coverage of the launch, the MAI team was formed just six months ago. GeekWire noted the models represent Microsoft's clearest step yet toward expanding beyond its OpenAI dependency.

For context on how Microsoft's current AI products work in practice, the guide to Microsoft Copilot Tasks covers what the existing Copilot can do. For current pricing across major AI platforms, the AI Chatbot Pricing Index for April 2026 covers both subscription and API rates.

Getting Access

All three models are available on Microsoft Azure Foundry. MAI-Image-2 is accessible in Bing and appearing in PowerPoint. The MAI Playground provides API testing access without a full Azure provisioning step.

Microsoft's bet is not that MAI will replace OpenAI in its products any time soon. It is that having credible alternatives reduces dependency, lowers costs, and keeps the partnership on its terms. Three production models in six months suggests the execution capacity to back that bet.

MAI-Transcribe-1: Faster, Cheaper Speech Recognition

MAI-Voice-1: Custom Voice Generation

MAI-Image-2: Top-Three Ranking at a Lower Price Point

Why Microsoft Is Building Its Own Models

Getting Access

Stay in the loop