Have you ever wondered how those story videos with voices and pictures are made?
It’s actually not hard. With one story and one AI assistant, you can do it in five steps.
Let’s use The Girl and the Milk Pail and walk through the whole process.
The Big Picture

Five steps: Record → Timestamp → Storyboard → Draw → Compose Video. That’s it.
Making a Video is Like Making a Sandwich

To make a sandwich, you need bread, meat, and veggies — stacked layer by layer.
Making a video is the same. Stack together audio, images, and a timeline, and the video appears.
Meet Your AI Team

You are the director. Tell the AI what story you want to make, and they handle the rest:
- Claude (assistant director) — your all-purpose helper. It calls Whisper to add timestamps, writes storyboards, directs Gemini to draw illustrations, and assembles the final video — it runs the whole process for you
- Whisper (transcriber) — listens to the recording and notes which second each sentence happens
- Gemini (artist) — draws beautiful illustrations based on descriptions
Let’s Make It!
🎤 Step 1: Read the Story Out Loud and Record It

Pick up The Girl and the Milk Pail, read it out loud like a bedtime story, and record it on your computer.
It doesn’t need to be perfect. Just be natural.
⏱️ Step 2: Add Timestamps

Claude calls Whisper to note when each sentence appears in the recording:
- 0:30 — “milking the cow”
- 0:50 — “daydreaming about new clothes”
- 1:06 — “she fell down”
Now we know exactly which second each sentence is spoken.
🎬 Step 3: Write the Storyboard

Every time a new action happens in the story, you need a new picture. Using the timestamps, Claude breaks the story into storyboard scenes:
- Milking the cow on the meadow
- Carrying the milk pail home
- Daydreaming about beautiful new clothes
- Tripping and spilling the milk
For each scene, Claude writes an “instruction sheet” telling the artist what to draw.
🎨 Step 4: Draw the Pictures

Claude gives the instructions to Gemini, and in just a few seconds it creates an illustration.
For example, for the first scene “milking on the meadow,” the prompt looks like this:
Two farm girls on a sunny green meadow milking a cow. The older girl has long brown hair in braids, the younger girl has short black hair with a red hairband. A large clay milk pot sits between them. Pixar-style 3D animation, vibrant colors, cinematic lighting, 16:9 wide format.
Here’s the trick: always describe clearly what the characters look like — “the older sister with braids” and “the younger sister with short hair and a red headband.” Otherwise, the same character may look different in every picture.
🧩 Step 5: Compose the Video

Finally, Claude lines up three tracks — audio, images, and subtitles. When the narration says “she fell down,” the picture switches to the falling scene.

Once everything is aligned, press export, and the video is done!
Want to Try It?
- Pick a short story (1–2 minutes is enough)
- Read it out loud and record it
- Follow the five steps above
It’s okay if it doesn’t come out perfectly the first time. Try a few times and you’ll get better. Just start!