How I Built AI Picture Book

Community Article
Published June 15, 2026

Type one idea → MiniCPM writes the whole story → FLUX paints every page → read it in an animated flip-book, download a PDF, or share it to a global library.

I built AI Picture Book 📖 for the Gradio × Hugging Face Build Small Hackathon — a little book studio that lives inside one Hugging Face Space. You type a one-line idea — "a shy dragon who opens a tiny tea shop," "a kitten ninja saves the noodle festival" — pick a look, and a pipeline writes a complete, illustrated story: a children's picture book, or a manga/comic with real panels and speech bubbles. Then you read it page by page in an animated reader, download a print-ready PDF, or publish it to a community library for anyone to read.

One idea organized everything: the words and the pictures are AI-generated and different every time, but the reader, the PDF, and the gallery are a fixed, hand-built front end that works for any book. That constraint is what turns an arbitrary sentence into a finished, shareable object.

▶️ Watch the demo on YouTube — 45 seconds: a kids' bedtime picture book, "bring your own hero," and an action comic, then publish to the library.

Architecture

A few small models behind two routes: MiniCPM4.1-8B writes the story, MiniCPM-V reads an uploaded hero, FLUX.2-klein-9B paints the pages on ZeroGPU, and Nemotron-3.5-Content-Safety screens everything from a Modal GPU. A Hugging Face Bucket holds the shared library; the reader, PDF and gallery are vanilla JS with zero frameworks.

Writing a whole book, page by page

The first version was the obvious one: a single MiniCPM call that returned the entire book as one JSON pack — title, cast, every page's text and scene. It worked for a 4-page picture book and fell apart on a 6-page comic with 4 panels each. The JSON would run past the model's output budget, truncate mid-page, fail to parse, and drop into a fallback that stamped the book's own title onto every page. The art rendered; the words never did.

The fix was to stop cramming. Generation is now one small "plan" call plus a fan-out of small per-page calls (prompt_pipeline.py). The plan returns just the skeleton — title, a 1–3 character cast (each with a fixed visual look written once), a cover scene, and exactly N one-line beats, one per page, forming a continuous arc. Then each page is expanded by its own call, run in parallel with a ThreadPoolExecutor. Every page call gets the whole outline and is told to write the marked page so it flows from the previous one — same names, same setting, clear cause and effect — which is what killed the "broken, disjointed" feeling and the name drift (Seraphine → Seraphia → Seranine) that plagued the monolithic version. Small responses never truncate, and a comic page reliably comes back with exactly the requested panel count, because the few-shot example is sliced to match.

A nice side effect: I stopped asking the model to restate each character's full description in every scene (the biggest token hog). Scenes just name the cast; the app re-injects the locked look into the FLUX prompt at build time. Shorter JSON, identical consistency.

Painting every page with FLUX.2-klein-9B

FLUX.2-klein is distilled — four steps, guidance 1.0 — so a single GPU attach paints the whole book: the cover plus up to eight pages in one gen_images call (flux_local.py). Pages are A4 portrait by default (896×1264), square or landscape on request. The same art_style tail rides every prompt so the book stays visually cohesive, and a "no text, full-bleed" tail keeps picture-book illustrations clean (the caption is rendered crisply by the app, not baked by the model). Locally, with no GPU or torch, flux_local.AVAILABLE is False and the app falls back to soft placeholder pages so the reader still works.

Keeping a character on-model — and bringing your own

klein is text-to-image; there's no reference-image conditioning to lean on. So consistency is bought with words: the cast's fixed look is restated into every page prompt, so the artist draws the same character each time. Then I added the feature I like most — upload a hero. A child can photograph their own drawing; MiniCPM-V (the vision sibling, same key) describes it in one vivid line, that description is screened by the safety gate like any other text, and it becomes the hero's locked look across the book. klein does accept an image= reference, so the photo also rides along as a visual anchor (the same trick the runner uses for its player sprite). Bedtime stories for the little ones, their own drawing as the star, or your favorite character — same pipeline.

Comics: the bubble problem

This was the saga. klein can render short text — "DASH!!", "Gotcha!" — but ask it to bake a full sentence into a speech bubble and you get beautiful art wrapped around gibberish: "KAITO WAS LECIIEDM BLH NIHIS PRIASE." I tried everything to coax it — one short bubble per panel, no narration boxes, "render text clearly, correctly spelled." It got better, never reliable.

So I moved the text off the model entirely. For comics, FLUX now draws clean panel art with no text at all, leaving the top of each panel open. The app holds the dialogue (MiniCPM wrote it, tagged per panel) and draws the speech bubbles itself — crisp canvas text, rounded bubble, a little tail, an SFX word — positioned into each panel using a PANEL_GRID that matches the layout the model was told to draw. And the key decision: I bake it once. Right after generation, a canvas composites the bubbles onto each page image a single time; from then on the reader, the PDF, the shared copy, and the My-Books shelf all show that one annotated image. No live overlay to mis-scale, no re-drawing in the PDF, perfect consistency everywhere. Picture-book captions follow the same philosophy — rendered crisply by the app, never trusted to the diffusion model.

The reader, the PDF, and one source of truth

The reader is a hand-built flip-book — an animated page-turn, fullscreen, swipe and keyboard nav — and the PDF is built client-side with jsPDF straight from the images already in the browser: instant, offline, no re-upload, no extra Python dependency. Because comic text is baked into the image and picture-book captions are drawn from the same data the reader uses, "what you read" and "what you download" are always identical.

Content-safety gates hosted on Modal

People type anything, so NVIDIA Nemotron-3.5-Content-Safety (a public Gemma-3 4B) screens text at two points, both fail-open. First the raw idea, before a single MiniCPM call is spent. Then — once the per-page FLUX prompts exist — every one of them, screened concurrently with a small thread pool, before any art is generated. It runs the Aegis-V2 taxonomy and a per-category blocklist that stops real-world harm while letting fictional adventure through, so "a knight with a sword" or "a dragon's fiery brawl" sail past and "how to build a bomb" is stopped. moderation.py is a pure-stdlib, fail-open client; modal_safety.py is the Modal service, which scales to zero and is warmed at startup. The client only needs a URL and a token, so the same deployment backs all three of my hackathon Spaces.

A community library and a private shelf

Sharing is opt-in: press SHARE, name yourself, and the book — cover, pages, and comic dialogue — saves to a Hugging Face Bucket at /data as one immutable file per record, plus a tiny card for fast gallery listings; even likes are one empty file each. No read-modify-write, no locking, no race conditions. And because an unshared book would otherwise vanish on reload, every book you make is auto-saved to a private "My Books" shelf in localStorage — your creations as cards, on your device, capped and compressed to fit the quota.

What I learned

  • Decompose to beat the token wall. One plan call + parallel per-page calls never truncates, gives reliable panel counts, and — fed the full outline — reads as one continuous story instead of N disjoint scenes.
  • If the model can't spell, don't make it. klein paints gorgeously and letters badly; drawing the speech bubbles in the app and baking them on once made comic text perfect in the reader and the PDF alike.
  • Bake once, render everywhere. Compositing text onto the image a single time killed a whole class of "the reader looks different from the PDF" bugs.
  • Consistency is words, not weights. A fixed cast look restated per page — plus an optional MiniCPM-V caption of an uploaded drawing — keeps a character on-model with no reference-image conditioning.
  • A safety gate must never break the app. Two fail-open Nemotron passes — the idea, then every page prompt — off-Space on Modal, scale-to-zero, tunable by env.
  • ZeroGPU only funds real Gradio events. A plain fetch() (or a startup background thread) gets nothing; route GPU work through a postMessage bridge — for generation and for seeding.

Go make one

It's all in one Space: MiniCPM writing, FLUX painting, Nemotron guarding, a bucket remembering. Type a sentence, watch it become a book you can actually read and hold — then share it for the next person to find.

Now make yours.

Community

Sign up or log in to comment