OpenAI's Realtime Voice Models: What Changes for AI Video

Tellers Team · May 9, 2026 · 5 min read

On May 8, 2026, OpenAI moved its Realtime API out of beta and shipped three new audio models alongside it: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. These are voice models, not video models — but they reshape AI video creation because every modern video workflow runs on top of an audio layer.

Here is what they do, why they matter for video, and where Tellers fits in.

What OpenAI Released

Three new Realtime API models, all generally available:

GPT-Realtime-2 — voice-in, voice-out conversation with reasoning, tool calls, interruption handling, and a 128k token context window. Priced at $32 per million audio input tokens and $64 per million audio output tokens.
GPT-Realtime-Translate — live speech translation from 70+ input languages into 13 output languages, priced at $0.034 per minute.
GPT-Realtime-Whisper — streaming speech-to-text that transcribes as the speaker talks, priced at $0.017 per minute.

OpenAI also confirmed that the Realtime API itself is now generally available. For teams that held off building production systems on a beta contract, that is the more important headline.

Why Audio Models Matter for Video

Most video editing tasks are anchored to an audio timeline. The transcript drives cuts. The voiceover drives B-roll selection. Dialogue drives subtitles. Speech and music together drive pacing. When the audio layer gets faster, cheaper, or more capable, the entire video pipeline benefits.

A few concrete examples:

Transcription is the spine of an automated edit. An editor agent cannot find the strongest moments of an interview without an accurate, time-aligned transcript. Streaming transcription means an agent can begin reasoning about footage before the recording finishes.
Translation unlocks multilingual video at scale. A single source video can become localized versions across 13 output languages using a synchronous pipeline, instead of a batch translation workflow that adds hours to delivery.
Voice reasoning enables conversational editing. A model that can listen, reason, and respond inside a single audio session is the foundation for hands-free editing — directing a video the way you would brief a colleague, rather than typing a prompt.

These are not hypothetical. They are the workflows behind podcast-to-video tools, multilingual social content engines, automated highlight reels, and accessibility-first video pipelines.

What This Means for AI Video Creation

The video model conversation tends to focus on the visual layer: Veo 4, Seedance 2, Runway Gen-4.5, Kling 3.0. The audio layer has been moving fast in parallel and is just as load-bearing for end-to-end video. Real-time voice with tool use shifts what an AI video editing agent can do during a session, not just at generation time.

Specifically:

Faster, cheaper transcription makes automatic edit decisions viable on long-form content — lectures, panels, podcasts, multi-camera interviews.
Live translation makes localized publishing realistic in minutes rather than days, often without re-recording voice.
Streaming voice models reduce the latency floor for interactive editing — a request like “tighten the intro” can return a useful response while the rest of the project is still indexing.

The combination of cheaper audio intelligence and capable video generation is what makes end-to-end AI video editing practical. Neither half is enough on its own.

Where Tellers Fits

Tellers is built around the assumption that audio is a first-class input for video creation. Podcast-to-video, audio-to-video, transcript-driven editing, and chat-based timeline edits all rely on a strong audio layer. As foundation models for transcription and translation get faster and cheaper, the agent can do more in less time, in more languages, at lower cost per output.

We will evaluate the new realtime audio models for Tellers’ internal pipelines. As always, we will be specific about what is shipped versus what is in evaluation. The relevant Tellers capabilities — audio-driven editing, the agent, and multi-model orchestration — are live today and do not depend on any single provider.

FAQ

What did OpenAI release on May 8, 2026?

OpenAI made its Realtime API generally available and shipped three new audio models alongside it: GPT-Realtime-2 for voice-in/voice-out reasoning with tool use, GPT-Realtime-Translate for live speech translation across 70+ input languages and 13 output languages, and GPT-Realtime-Whisper for streaming speech-to-text.

Why do voice models matter for AI video creation?

Most video edits are anchored to an audio timeline. Transcripts drive cuts, voiceovers drive B-roll selection, dialogue drives subtitles, and pacing follows speech and music. Faster, cheaper, more capable voice models make the rest of the video pipeline faster and cheaper too.

How is GPT-Realtime-Whisper different from the original Whisper?

The original Whisper is a batch speech-to-text model. GPT-Realtime-Whisper is a streaming model that transcribes audio live as the speaker talks, which makes it suitable for real-time agent workflows rather than only post-processing.

Are these new OpenAI models integrated into Tellers today?

The audio-first capabilities on Tellers — chat-based editing, audio-to-video, podcast-to-video, and multi-model orchestration — are already live and do not depend on a single provider. We will evaluate the new realtime audio models for internal pipelines and share updates if we ship an integration.

How does multilingual AI video work in practice?

Multilingual video typically combines streaming transcription, translation, and either subtitle relayering or voice synthesis over a source recording. Streaming voice models cut latency at each step, which makes same-day localization realistic for short-form content.

If you want to see what an audio-first AI video workflow looks like in practice, open Tellers and start with a podcast, an interview, or a voice memo. The agent will take it from there.