Professor Pip - a tiny teacher who lives in your browser

Community Article
Published June 15, 2026

There's a strange empty spot in the AI demo landscape. Everyone's shipping chatbots in text boxes. A few brave souls do voice. But a 3D character — one that actually looks at a kid, raises an eyebrow when they get something right, and teaches a five-minute lesson — somehow almost nobody builds, because the obvious way to do it is expensive: stream a GPU-rendered face, burn money on every frame. So we didn't do it the obvious way.

The moment that made the whole project worth it: a five-year-old asked "why is the sky blue?" out loud, mid-lesson, and Professor Pip stopped, answered her in one warm sentence, did a little "good question" gesture, and went back to the lesson. She didn't know there were four models and two clouds involved. She just thought the cartoon teacher was listening. That's the bar.

What Professor Pip is?

Screenshot 2026-06-16 at 05.00.43

Watch the YouTube Demo Video

Professor Pip is a 3D talking-avatar teacher for kids aged 5 to 10. There are ten premade five-minute voice courses — science, math, nature, a story or two (sky-blue, photosynthesis, odd and even, the water cycle, day and night, butterflies, bees, the five senses, shapes, magnets) — plus a "make your own lesson" box and spoken, raise-your-hand Q&A. Captions are on by default. Get something right and you earn stickers, stars, and confetti. It's built for the Build Small hackathon, in the Backyard AI track: practical tools for real people — a "personal study tutor" / "storybook for a child" that runs on a tiny model you own.

The architecture (or: where the GPU isn't)

The headline trick is that the 3D avatar renders in the browser at ~60fps with zero GPU touching the face. We used met4citizen's TalkingHead (MIT) on top of Three.js, so the WebGL character — fifteen visemes, mood expressions, body gestures — runs entirely on the kid's laptop. The cloud never renders a pixel of Pip.

The lesson itself is a deterministic state machine in the browser. There's no agent loop chewing tokens for five minutes; the frontend just walks the course, and premade segments are spoken verbatim — authored in Pip's voice and sent to TTS as-is. That keeps the whole thing cheap and predictable.

Behind it sits a Gradio Space on Hugging Face orchestrating four stateless endpoints — /asr, /brain, /speak, /make_course — with heavy compute on one scale-to-zero Modal GPU container (Kokoro-82M for speech, faster-whisper small for listening, and the fine-tuned brain). Scale-to-zero matters: a kids' learning toy that's idle most of the day shouldn't cost anything when nobody's using it.

Safety is deterministic, server-side, and non-bypassable. A curated denylist plus leetspeak normalization checks every child input and every line before it's ever voiced. Anything unsafe becomes a friendly redirect — "let's pick something we can learn about!" No child audio, no PII, ever persisted. We didn't trust the model to be the safety layer; the model is the backstop, the code is the gate.

The fine-tune: voice and contract, not knowledge

Here's the part I'd defend in a code review. We fine-tuned a tiny model — but not to make it smarter. A 1B model is never going to out-know the internet, and trying to cram facts into a LoRA is a fool's errand. We fine-tuned it for two things only: voice and contract.

  • Base: openbmb/MiniCPM5-1B-SFT — a standard LlamaForCausalLM, 1.08B params, Apache-2.0.
  • The LoRA: rank 32, alpha 64, dropout 0.05, targeting the attention and MLP linears; 3 epochs, lr 2e-4 cosine, bf16, with assistant-only loss masking. 22.4M trainable params — about 2% of the model.
  • The data: ~2,016 synthetic in-voice examples, generated and then strictly validated through the same safety gate that runs in production. Balanced roughly 30% raise-hand answers, 30% lesson delivery, 25% encouragement, 15% safe redirects.
  • Training: on Modal (one A10) in about 12 minutes. Final train loss 1.30.

The contract is the whole point. Every reply is exactly one JSON object — {"text", "mood", "gesture"} — where mood drives the avatar's face and gesture drives its body. MiniCPM5's ChatML template prefills an empty <think></think> (no-think mode), so there's no reasoning trace leaking out — just the one kid-facing line.

The held-out contract eval (150 gold examples):

  • 100% valid JSON
  • 100% valid mood/gesture enums
  • 99.3% safe
  • 99.3% fully contract-correct
  • average reply ~142 characters

Then we merged the LoRA, converted to GGUF, and quantized to Q4_K_M (688 MB) and Q8_0 (1.15 GB). We serve it with llama.cpp (via llama-cpp-python) on Modal — on CPU, since the A10 is busy with Kokoro — and verified it on the live endpoint. One nice detail from the trenches: MiniCPM5 ends turns with <|im_end|>, not the EOS token, so generation has to stop on that string or it'll keep babbling. Little things like that are 90% of "deployed and working."

The honest lesson

The narrow LoRA did exactly what we asked — and that turned out to be a double-edged sword. It made the model excellent at the short live-voice contract: raise-hand answers come back clean, in-character, on-format, basically every time. But it degraded long-form course authoring. Ask the fine-tuned model to write a whole new five-minute lesson and the very thing that made it sharp on one-liners made it worse at structure.

So "make your own lesson" uses a deterministic template fallback instead of the LoRA. And honestly? That's the lesson worth writing down. Fine-tuning isn't free — you're trading generality for a sharp edge, and you have to choose where to put the edge. We put it on voice and contract, where determinism would be brittle and the model earns its keep. We kept deterministic code for everything that's actually a structure problem, not a voice problem. Knowing which is which — that's the skill.

Links

Built on TalkingHead + Three.js, Kokoro-82M, faster-whisper, MiniCPM5-1B, Modal, and llama.cpp. Go ask Pip why the sky is blue.

Community

Sign up or log in to comment