Intelligence without self-reflection is just pattern matching. We've been running Ollama for inference and ChromaDB for RAG search, but the system had no way to evaluate its own answers, generate training data, or improve over time. This session laid the full foundation for a self-learning loop: LLMs grade each other's responses, high-quality code chunks get distilled into training pairs, and the local NVIDIA GPU opens the door to local fine-tuning. Six deliverables across evaluation, training, automation, and research.
Multi-Tier Evaluation System
Built a three-tier response evaluation engine that fires asynchronously after every chat response. Tier 1 uses the same local Ollama model to self-assess accuracy, relevance, completeness, and clarity on a 1–10 scale. Tier 2 sends the same prompt to a different local model for cross-validation — if Llama 3.2 generated the answer, Phi-3 grades it. Tier 3 is the supervisor tier: opt-in calls to Claude or OpenAI's API for ground-truth evaluation, either on-demand or via random sampling. All scores log to JSONL with trend analysis and markdown reporting. Three new API endpoints expose system status, historical reports, and manual evaluation triggers.
Knowledge Distillation Pipeline
The training module pulls code chunks from any RAG collection, sends them to a teacher model (Claude or OpenAI), and generates 3 Q&A training pairs per chunk in OpenAI conversation JSONL format. The pipeline is resumable (tracks processed chunk IDs), provides cost estimation before execution, and outputs data ready for fine-tuning. With ~6,000 chunks across all collections, we can generate ~18,000 training pairs — well above the 5K–10K sweet spot for domain adaptation.
Embedding Model Upgrade
Swapped from all-MiniLM-L6-v2 (384 dimensions, 512 token context) to nomic-embed-text-v1.5 (768 dimensions, 8,192 token context). The 16x context window is the real win for code — the old model was truncating most function bodies at 512 tokens. The model is configurable via environment variable with the old model as the default fallback, so rollback is one env var deletion away.
Goose Autonomous Tasks
Set up three headless Goose tasks that run on Windows Task Scheduler: a nightly code review across all repos (daily 3 AM), a test coverage audit (weekly Sunday 4 AM), and a documentation sync check (weekly Wednesday 4 AM). All tasks are strictly read-only — they can only write reports to data/reports/, cannot modify source files, cannot run package installs, and cannot make git commits. The .goosehints file enforces safety constraints at the project level.
Jenkins RAG Automation
Added a git post-commit hook that triggers Jenkins RAG re-indexing immediately when Python, JavaScript, or TypeScript files change — no more waiting up to 12 hours for the next cron cycle. Also added a Saturday morning full-index job for SlopShop and Portfolio collections that were previously manual-only. A security review tightened the hook to use Jenkins API token auth instead of anonymous triggers.
Phase 2: Fine-Tuning Research
The local NVIDIA GPU unlocks QLoRA fine-tuning of 7–8B parameter models locally. A dedicated research agent evaluated frameworks, models, and deployment pipelines. The verdict: Unsloth for training (2–5x faster, 80% less VRAM, built-in GGUF export), Qwen2.5-Coder-7B-Instruct as the base model (61.6% HumanEval, outperforms models 3–4x larger), and GRPO for self-improving reasoning once the base fine-tune is solid. Full training run estimated at 1–3 hours on local hardware. The 900-line research doc covers CUDA 12.8 compatibility, WSL2 requirements, deployment through Ollama, and cost analysis.
Numbers
| Metric | Value |
|---|---|
| New Python modules | 8 files: evaluation (4), training (4) |
| Modified files | 4: API server, RAG store, config, ComfyUI generator |
| New automation | 3 Goose tasks, 1 git hook, 1 scheduler script |
| Research output | 900-line Phase 2 fine-tuning plan |
| Total lines added | ~2,500 across 18 files |
| Security fixes applied | 7 (from pre-commit security review) |
What's Next
- Full RAG re-index with the new embedding model (all collections, ~1–2 hours)
- Manual Goose task test run to validate read-only behavior
- Knowledge distillation pilot: 100 chunks through Claude to generate initial training data
- WSL2 + Unsloth environment setup for first QLoRA fine-tuning run
- Deploy custom
jefeai-codermodel to Ollama once training data is sufficient