This one lit up every neuron in the pipeline. We built an end-to-end system that takes a raw D&D session transcript from Discord, runs it through a multi-stage AI pipeline (scene extraction, illustration, narration, indexing), and delivers the results back to Discord, to a dashboard, and to a public showcase page. Three repos touched, ~3,500 lines of new code across JefeAI, the DnD Bot, and jefehz.org. See the live showcase here.
GygaxBot: The Discord Bot
GygaxBot is the DnD Bot's archival brain. When a DM runs !session archive, the bot captures the full channel transcript and POSTs it to the JefeAI API's /dnd/session/archive endpoint. The bot receives a job ID, polls for completion, and posts scene illustrations with narration text directly into Discord as rich embeds. It also supports --backend flags to choose between Gemini (cloud, supports reference photos) and Flux.1 Dev (local GPU via ComfyUI) for image generation. Campaign data, character sheets, NPC indexes, and session directories all live under the bot's campaigns/ tree, which the JefeAI API reads directly for reference images and metadata.
The Archival Pipeline
The JefeAI DnD router (dnd_router.py — ~1,350 lines) orchestrates a five-stage async pipeline. Stage 1: the transcript hits a local Llama 3.1 8B via Ollama, which extracts 4–6 key scenes with titles, visual descriptions, narration scripts, character lists, and locations from a carefully crafted extraction prompt. Stage 2: Fish Speech voice cloning (or Kokoro TTS as fallback) generates dramatic audio narration for each scene. Stage 3: scene descriptions are enhanced with campaign style tags and sent to the Gemini image API with up to 12 character reference photos for visual consistency — or to Flux.1 Dev locally, or both in parallel. Stage 4: ambient audio generation is stubbed (Stable Audio Open quality wasn't sufficient, but the pipeline slot is ready). Stage 5: ChromaDB indexes the transcript, summary, and scene narrations into a dnd-campaigns collection for semantic search across all campaign history.
Dual Image Backends
We started with Flux.1 Dev running locally through ComfyUI — good quality but no reference image support, so characters looked different in every scene. Adding Gemini's image generation API solved this: the system automatically discovers character reference images on disk (reference/characters/{name}.png) and feeds them to Gemini alongside text descriptions. A single character can have multiple reference photos (portrait, action pose, with gear). The image_backend parameter accepts "flux", "gemini", or "both" — "both" generates two versions of each scene for comparison. ComfyUI also got an overhaul to support the Flux.1 Dev model specifically, with proper UNET loading and 4-bit quantization for the 12B parameter model.
TTS: Kokoro and Voice Cloning
Two TTS backends were built. kokoro_service.py wraps Kokoro TTS with campaign-specific voice profiles (each campaign gets a default narrator voice). fish_speech_service.py implements voice cloning via Fish Speech — drop a narrator.wav reference file in the campaign directory and all narrations use that cloned voice. Both services run on CPU (they share the machine with the GPU-hungry image generators), output 24kHz WAV files, and handle per-scene file naming. The pipeline tries voice cloning first and falls back to Kokoro if no reference audio exists.
Dashboard: Gygaxbot Tab
Added a "Gygaxbot" tab to the JefeAI dashboard at localhost:8000/dashboard. Four new read-only backend endpoints serve campaign lists, session detail (scenes + transcript + media file lists), image/audio files via FileResponse, and character reference photos — all with path traversal validation. The frontend provides a campaign selector, character cards with reference image thumbnails, expandable session cards with scene galleries, inline audio players, image lightbox, transcript viewer, and media regeneration with job status polling. RAG search across sessions is wired through the existing /dnd/session/search endpoint.
Showcase: jefehz.org/gygaxbot
Built a public showcase page presenting Session 1 of the Spitwater campaign ("Democracy via Large Artillery"). The page walks through the five pipeline stages with technology tags, shows the party's three characters with their reference photos, then presents all six extracted scenes as a gallery with AI-generated illustrations and embedded audio narration. A technology deep-dive explains each component: Ollama for scene extraction, Gemini with reference images for illustration, Kokoro for narration, ChromaDB for RAG, and FastAPI for orchestration. The media assets were copied to the jefehz.org static directory for standalone hosting independent of the JefeAI API.
Numbers
| Metric | Value |
|---|---|
| New Python (JefeAI) | ~3,300 lines across dnd_router, tts services, comfyui, gemini, rag retriever |
| New JS (DnD Bot) | archiveService, session commands, callback handling |
| New JS/HTML (Dashboard) | ~400 lines: api methods, gygaxbot tab, CSS |
| New HTML/CSS (Showcase) | ~550 lines: standalone page + styles |
| Pipeline time (Session 1) | <7 minutes end-to-end |
| Scenes extracted | 6 (from ~8,000 word transcript) |
| Images generated | 6 (Gemini with character references) |
| Audio narrations | 6 WAV files via Fish Speech / Kokoro TTS |
What's Next
- Multi-session showcase navigation as more sessions get archived
- Automated showcase page generation from the archival pipeline
- Ambient audio per scene (waiting for better models)
- Campaign comparison view across Spitwater and Goldengloom
- Discord embed improvements with scene carousel navigation