▶ Video summary · Rendered with HyperFrames · Andrew Neural narration · 44s

HyperFrames: My Blog Just Got a Video Version

A new Nous Research skill turns ordinary HTML into rendered MP4 — and the first thing I shipped with it is the post you're reading.

May 2026 • AI Tooling • 8 min read

🎬 The Short Version

HyperFrames is a new optional skill in NousResearch/hermes-agent that captures HTML/CSS/GSAP compositions to MP4. Authors design in HTML, define a timeline, and the framework hands back a polished video. I installed it, defined a visual identity, wrote a thirty-second narration, generated a few AI images for backdrop, and rendered the summary video at the top of this page. Total wall-clock time on a fresh project: under fifteen minutes.

What HyperFrames Actually Is

HyperFrames lives at NousResearch/hermes-agent/optional-skills/creative/hyperframes on GitHub. The pitch is genuinely cool: instead of fighting After Effects timelines or wrestling with Remotion's React-only world, you author videos as plain HTML documents. CSS handles layout. GSAP handles animation. A small set of conventions (data-start, data-duration, data-track-index, a registered window.__timelines entry) tells the renderer what to capture.

Then a CLI walks the timeline frame-by-frame in headless Chrome and stitches the result with FFmpeg. Output: standard H.264/AAC MP4. Plays everywhere.

What makes it interesting for an AI-agent workflow:

Deterministic by design. No Math.random(), no Date.now(), no network fetches in compositions — the framework lints for it. Every render is reproducible.
Local pipeline. Text-to-speech narration, Whisper-style transcription, captions, transitions, render. Nothing ships off the box unless you ask it to.
Lintable. npm run check catches GSAP/CSS transform conflicts, missing classes, contrast issues, and layout overflows before you waste minutes on a render.

How I Installed It

The skill is in the optional-skills/ tree, so it isn't loaded by default. The whole install on the VPS was three commands and a Node version check.

# Make sure Node 20+ is on PATH
        node --version  # I'm on v22.22.2

        # Pull and add the skill (it ships with hermes-agent)
        cd ~/.hermes/hermes-agent
        git pull

        # Add audio prerequisites for local TTS + transcription
        pip install kokoro-onnx soundfile

The CLI itself runs via npx hyperframes — no global install needed. From a project directory you get npm run dev, npm run check, npm run render, and npm run publish. Everything else is just authoring HTML.

Bootstrapping a Project

One command scaffolds a project:

mkdir -p ~/hyperframes-projects/hyperframes-blog-summary
        cd ~/hyperframes-projects
        npx hyperframes init hyperframes-blog-summary --non-interactive --example product-promo

That gave me a working index.html root composition, a compositions/ directory for sub-comps, an assets/ folder, and a meta.json. The example was fine to study — and easy to delete once I had my own scenes drafted.

The Authoring Loop

Every project follows the same shape:

Write a DESIGN.md that locks the palette, fonts, and tone. The agent reads this before drafting compositions, which keeps brand drift out of the output.
Write a SCRIPT.md — narration text in plain English, sized to the runtime you want. ~120 words gives you a comfortable thirty-second video.
Generate narration. I used edge-tts with en-US-AndrewNeural for voice consistency with the existing book reviews on the site. The script writes both an MP3 and an SRT subtitle file in one pass.
Build the composition. Each scene is a <div class="scene clip"> with data-start and data-duration. A single GSAP timeline orchestrates everything, paused, registered on window.__timelines under the composition's id.
Lint. npm run check validates structure, runs the page in headless Chrome, samples nine timeline points for layout/contrast issues, and flags any timing math that doesn't add up.
Render. npm run render -- --quality high emits a 1080p MP4 in the renders/ folder.

⚠️ Pitfalls I Hit

Image dimensions. Venice's image API caps width at 1280, so requesting 1920×1080 returns a 400. Generate at 1280×720 (or 1024 square) and let CSS background-size: cover fill the frame.

Composition root. The root element needs id="root", data-composition-id, data-width, data-height, and data-start="0" with data-duration. Missing the duration attribute makes the inspector blow up with a cryptic totalDuration error.

Timeline key must match composition id. If your composition is data-composition-id="my-promo", your timeline must register as window.__timelines["my-promo"] — not "root". The lint catches this; trust it.

What I Actually Used It For

The first thing I rendered was a thirty-second promo for my book review of The Four Agreements. The whole pipeline — DESIGN.md, SCRIPT.md, TTS, captions, animation, render, deploy — took one focused session. The result lives on a dedicated page where you can watch the promo with all the site styling and a link back to the full review.

That gave me enough confidence to use it for this very blog post. The video at the top of this article is itself a HyperFrames render — same pipeline, same Andrew Neural voice, same riverside palette. The composition source lives in ~/hyperframes-projects/hyperframes-blog-summary/index.html, the AI backdrop images came from the venice-ai-media skill, and the whole thing rendered in a couple of minutes at --quality high.

Why This Matters for AI-Driven Sites

Most AI-built websites are walls of text with the occasional generated image. That's fine, but it's also flat. HyperFrames is the first tool I've used where an agent can author a real video as code — review-able, version-able, lint-able — without ever opening a non-determinstic editor. The output isn't a slideshow with Ken Burns transitions; it's actual motion graphics, with synced narration and captions, that an agent can iterate on the way it iterates on a CSS file.

Two practical things I'm now planning:

A promo for every book review. One thirty-second highlight per review, embedded on the gallery page. The Four Agreements is up; the rest are next.
Video summaries for long blog posts. Like this one. If a post is over 1,500 words and the topic benefits from motion (architectures, before/after comparisons, anything with a timeline), it gets a HyperFrames intro at the top.

Honest Caveats

It's not magical. A few things to keep in mind:

Author-time still matters. The agent still has to actually compose good frames. Lazy scene design produces lazy video — same as lazy CSS produces ugly pages.
Render is single-threaded headless Chrome. A forty-four-second 1080p render at high quality takes a few minutes on a modest VPS. Iterating in --quality draft is the right move; only switch to high when you're happy.
Captions are coarse without word-level timing. edge-tts emits SRT phrases, not per-word timestamps. If you want karaoke-style captions, route the audio through Whisper to get word-level transcript.json first.

None of those are dealbreakers. They're the normal trade-offs you'd expect from a tool that's letting you author videos as HTML — and they're well documented in the skill itself.

Try It

If you have a Hermes setup already, the skill is one git pull away. If you don't, the GitHub page has the source: github.com/NousResearch/hermes-agent. Worth the half-hour to bootstrap a project and render your first thing.

I'll be using it a lot.