AI Video Infrastructure for Agents: The Tellers Stack

Tellers Team · April 25, 2026 · 8 min read

a16z published a piece this week declaring “2026 is the year we let agents edit it.” The timing is not a coincidence. The Model Context Protocol now has more than 10,000 enterprise servers and 97 million monthly SDK downloads. Google, OpenAI, Microsoft, and AWS have all adopted it natively. Agents are not a future runtime — they are the current one.

The problem is that video tools were not built for them.

Most platforms designed for human editors carry assumptions that break under agentic use: binary formats with no queryable structure, implicit context that lives in the user’s head, expensive analysis pipelines that run from scratch on every request, and preview workflows that require a full re-encode before an agent can evaluate what it just did. Hand a standard video API to an agent and you will spend most of your token budget on overhead.

At Tellers, we have been rebuilding around the inverse assumption: agents are the primary client. Here is what that stack looks like.

A Purpose-Built MCP Server, Not an API Wrapper

The fastest shortcut is to duplicate a REST API as an MCP server. We did not do that.

The Tellers MCP server exposes the internal primitives of the Tellers platform — the same low-level building blocks that power the editor — not a human-facing API wrapped in a protocol adapter. The distinction matters because primitives compose. An agent can express exactly the operation it wants without navigating abstractions that exist only for human convenience.

The interface is also designed around what LLMs need from a tool response: structured output, predictable schemas, minimal noise. We stripped any visual chrome or explanatory text that a human would find helpful but that wastes tokens when the consumer is a model.

Hosted MCP apps with visual previews for Claude, ChatGPT, Codex, Cowork, and Gemini are releasing soon — no local process required, listed in the major MCP directories.

An LLM-Optimised CLI

The Tellers CLI is installable via Homebrew. It is designed to be called from scripts and agent-driven pipelines, not just used interactively.

Output is structured for LLM consumption by default: clean JSON, consistent field names, no decorative formatting that a parser has to strip. When an agent calls the CLI, it gets back data it can act on — not a table formatted for a terminal.

A Video DSL That Respects Token Budgets

We built a custom video edit DSL — a domain-specific language for timeline operations:

Compact: multi-step edits expressed in a fraction of the tokens that diff or tool calls expresses
Deterministic: each expression maps to exactly one operation, no interpretation required
Fast to parse: the runtime resolves DSL instructions directly, without a secondary LLM call and also sanitises the timeline (for example making sure clips don’t ask longer duration that the source video)
Cheap to run: we hyper optimised the token efficiency of the DSL making it super fast and cheap

It is a language purpose-designed for the latency and cost constraints of agentic loops.

Video Memory: Index Once, Query Forever

The structural problem with bare-bones AI agents doing video work is that every request starts from scratch. To find all moments in a 45-minute interview where the speaker hesitates, an agent needs to transcribe the audio, parse the transcript, and correlate it with the timeline — on every single call.

Tellers processes video once and stores the results:

Transcripts with word-level timestamps
Scene boundaries detected from the source material
Audio beats for music-synchronised edits
Embeddings for semantic search across a video library

An agent querying “find every moment the speaker says ‘actually’” against a pre-indexed Tellers project gets results in milliseconds. The analysis pipeline ran once. The cost is fixed, not per-request.

Real-Time Preview Without Re-Encoding

The agent feedback loop for video editing has a hidden tax: if every edit requires a full re-encode to preview, iteration is slow and the agent cannot validate its work mid-flight.

Tellers has an in-house video player that renders edit state in real time without re-encoding. An agent can display the current state of a project immediately after issuing an edit — not after waiting for a render job. In MCP applications, this means the user sees a live preview of every edit as the agent makes it.

Cloud-Based, Available Everywhere

Local video editing requires a powerful machine. The Tellers cloud-based model removes that constraint entirely: a Claude artifact, a Codex sandbox, a Gemini workspace — any agent in any environment can issue API or MCP calls and get back a rendered video URL. No GPU required on the client side.

State persists across agent sessions. A video indexed on Monday is queryable the same day or on Friday. Projects accumulate context; pontentially all you video library and archives; agents do not start from scratch.

Full Primitive Access Via MCP

The Tellers MCP server exposes all the internal primitives — not a curated subset. Agents have the same control over the timeline, generation parameters, asset management, and rendering that a human user has in the editor. The goal is that anything a developer can do in the Tellers UI, an agent can do programmatically.

Aggregating best-in-class generative models without extra subscriptions or payments

Tellers acts as a unified orchestration layer across the best generative models for video, audio, and images. Instead of juggling multiple tools, APIs, and subscriptions, agents can access everything through a single interface.

Under the hood, Tellers routes each task to the most relevant model based on quality, speed, and cost. This means an agent can seamlessly combine different providers within the same workflow — generating footage, voiceovers, music, or effects — without the user needing to manage integrations or billing across multiple platforms.

The result is a simpler and more efficient setup: one system, one billing layer, and full access to a constantly evolving ecosystem of state-of-the-art models.

(Coming soon) Budgets and fine-grained permissions to safely unleash your agents

Giving agents full control is powerful — but it needs guardrails.

Tellers introduces fine-grained controls that let you define exactly how your agents can operate. You can set maximum budgets, restrict access to specific models, and limit which video libraries or folders can be used.

This allows you to safely “let loose” your agents while staying in control of costs, data access, and outputs. Whether you’re running large-scale automated workflows or experimenting with new use cases, these constraints ensure predictable behavior and prevent unexpected usage.

In practice, it turns agents from experimental tools into reliable production systems.

The infrastructure bet we are making is that capable models are not the bottleneck for agentic video workflows. The bottleneck is tooling: do agents have the right primitives, the right memory layer, the right output formats, the right feedback loops?

That is the stack we are building.

What is the Tellers MCP server?

The Tellers MCP server exposes the internal primitives of the Tellers platform to AI agents via the Model Context Protocol. It is purpose-built for agent use — not a duplication of the REST API — with outputs optimised for LLM consumption.

What is the Tellers video edit DSL?

The Tellers video DSL is a domain-specific language for expressing timeline edits. It is compact, deterministic, and fast to parse — designed to minimise token consumption when agents issue multi-step edit instructions.

How does video memory work in Tellers?

Tellers indexes video content once and stores the results: transcripts, scene boundaries, audio beats, and semantic embeddings. Agents query this index on subsequent requests instead of re-running expensive analysis pipelines every time.

Is the Tellers CLI available?

Yes. The Tellers CLI is installable via Homebrew. It produces structured, LLM-optimised output — clean JSON, predictable schemas — so it integrates cleanly into scripted and agent-driven workflows. Install Tellers CLI

Do agents need to re-encode video to see a preview?

No. Tellers has an in-house video player that renders edit state in real time without re-encoding. An agent can display a live preview of its current edits immediately, without waiting for a render job.

Which AI assistants will support the Tellers MCP apps?

Tellers is releasing native MCP applications for Claude, ChatGPT, Codex, Cowork, and Gemini. These are coming soon — the beta MCP server is available today on GitHub.

To try the Tellers MCP server today, the beta is on GitHub. To start building with the API directly, open the Tellers app. The hosted MCP apps for Claude, ChatGPT, and others are coming shortly.