Brain Reps: Building a Reproductive Endocrinology Research Tutor
A graph-augmented spaced-repetition research agent built solo in 10 days — Neo4j, Qdrant, Cloudflare Workers, and Ollama, at $0 cloud cost. From monolith to multi-channel production.
100% quiz delivery rate across 3 channels (Slack, Email, Teams). 20%+ knowledge retention improvement via FSRS scheduling. Concept stabilisation within 3 review cycles. 200+ cross-domain intersections mapped. $0 cloud infrastructure cost on Cloudflare free tier.
The Problem Worth Solving
Reading research papers is a great exposure to novel discoveries and develops on the pre-existing mental schema. However, only reading the paper does not constitute learning.
Think of it this way: you cannot fill a bucket by walking past a river. Proximity to water is not the same as contact with it, and contact is not the same as drinking. The knowledge worker who reads a hundred papers and retains the ideas from perhaps ten of them has walked past the river ninety times. The water was real, but the engagement was missing.
The actual learning, the kind that sticks and can turn into a skill, happens through retrieval. Being forced to reconstruct an idea from memory, discovering exactly where the gap is, and re-encoding with that gap closed: this is how understanding develops. Anyone working across multiple disciplines, such as medicine, philosophy, mathematics, computational psychology and more, knows this problem acutely, because the connections between fields are often where the real insight lives, and those connections are exactly what a passive reading habit destroys. You read neuroscience on Monday, endocrinology on Thursday, and by the following week, neither paper is speaking to the other, even though they were always in conversation.
I developed Brain Reps to close that gap. Not as a study app, but as a tutor that reads leading research papers alongside you, maps the concepts into a knowledge graph, and surfaces questions, particularly cross-domain questions, at the exact moment you are most likely to forget.
The goal was to develop the simplest system that actually worked for one person, then scale it to work for many. That constraint shaped everything that followed.
Day One: A Working Set of Local Agents
The initial commit landed on April 21 at 6:44 PM. By that evening, a Python FastAPI process running locally could ingest a PDF through Docling, chunk and embed it into Qdrant, extract concepts and relationships into Neo4j, generate a quiz question via Ollama, and post it to a Slack DM.
Local Large Language Models have come a long way since the very first version. Gemma's suite have proved to be memory efficient with impressive inference performance. However, a notable skill one needs to develop is the ability to consider the memory demands of the LLM, Embedding Model, Vector Store (in-memory), regular apps and OS operations. With a 24GB RAM, I assumed that gemma4:26b (17 GB in size), would comfortably work with a headroom of 7 GB. However, as you will see later, a smaller quantized version would work below the ceiling that triggers recruitment of VRAM from the SSD by the OS.
It was a monolith where everything lived in one process on one OS. The scheduler was launchd plists, and the "admin dashboard" was grepping log files. There were no tests.
The bootstrapped system generated the first quiz that arrived in Slack that night.

The Python Era: Failing Toward a Real System
What followed over the next three days was a systematic education in what it means to build with hardware, AI models, and external APIs rather than against them. Each failure produced a constraint. Each constraint shaped the final architecture.
The Memory Wall
The first ingestion of a large document timed out with HTTP 524 errors. The diagnosis was hardware, not code: gemma4:26b requires approximately 18 GB of model weights. Loaded alongside Qdrant, Neo4j, and Langfuse on a 24 GB M2, the OS began paging the model in and out mid-inference, killing HTTP connections. vm_stat made it visible — page-outs spiked exactly during the knowledge graph extraction step.
The fix was simple: swap gemma4:26b for gemma4:e4b, a 4-bit quantised variant at ~9 GB. Ingestion completed reliably. The lesson was more durable: on unified-memory machines, total model weight must leave room for the full service stack. A smaller model that fits is more capable than a larger model that thrashes.
The Embedding Schema Migration
Switching the embedding model from nomic-embed-text (768-dim) to snowflake-arctic-embed2 (1024-dim) caused Qdrant to reject all new ingestion attempts — existing collections were schema-bound to 768 dimensions. The resolution was a --rebuild flag that drops and recreates collections. The lesson: any embedding model change is a schema migration. Treat it as one.
The Classifier That Saw Only Publishing
The domain classifier was returning "Academia / Journal Publishing" instead of reproductive-endocrinology. A classifier that cannot distinguish the container from the contents is like a librarian who, asked for a book on cardiac surgery, hands you the library catalogue and calls it close enough. The root cause was under-specified: no controlled vocabulary, a single chunk as input, single-label output on a multi-domain corpus. The fix rewrote classify_domain() to pass three representative chunks, impose a 30-subdomain controlled vocabulary, and return multi-label output (primary_domain, subdomains[], secondary_domains[]) plus a rationale field for debugging. The output transformed from generic to precise.
The Knowledge Graph That Could Only Pair
The initial Cypher intersection query detected pairwise subdomain overlaps. That is correct for two-field synthesis. It is insufficient for interdisciplinary research, where the most interesting connections involve three to five subdomains simultaneously — the kind of insight that lives not at the intersection of two lines but at the centre of a triangle. The fix built two detection passes: a Cypher aggregation pass for pairwise strength with strength = shared_concepts / sqrt(total_sd1 × total_sd2), followed by a Python itertools.combinations pass for 3–5-way intersections with an arity bonus of strength × (1 + 0.15 × (arity - 2)). Strength thresholds were set at CORE ≥ 0.8, STRONG ≥ 0.5, MOD ≥ 0.2. Cross-domain synthesis questions became possible.
The Feedback Loop That Never Fired
This was the hardest failure. Quizzes arrived in Slack. Users replied. Nothing happened — no evaluation, no feedback posted back, no traces in Langfuse, no error logs. Five independent bugs, each masking the next.
Bug 1: trigger_feedback() made a synchronous httpx.post to localhost:8000/runs from inside the async FastAPI handler. It was blocking the event loop waiting for itself.
Bug 2: from datetime import datetime lived inside the try block. When the deadlock timeout fired, the except clause referenced an unassigned name.
Bug 3: Thread replies need conversations.replies. conversations.history was being called instead. The API returned 200 OK. all_replies was always empty. Evaluation was never reached.
Bug 4: find_pending_quiz matched the oldest quiz on the DM channel due to non-deterministic glob ordering. Fix: exact thread_ts match first, most-recent fallback second.
Bug 5: \' inside a single-quoted f-string expression in answer_evaluator.py caused Python to terminate the string literal early. The file imported without error in the editor. The SyntaxError raised at runtime inside asyncio.create_task landed silently in server.error.log, not server.log — invisible until an explicit log line was added before the empty-replies guard.
The debugging principle that unlocked this: always check both stdout and stderr logs. Adding "I am here" before the guard, seeing "7 answers" in the log but still no feedback, forced attention to what happened after the replies were fetched. Working backward from that known point, the syntax error surfaced.
The Architectural Reckoning
By April 23, the feedback loop worked end-to-end. The system was functionally correct. It was also structurally fragile.
A monolith, pressed to scale, is like a house wired with one circuit breaker for every room. When the breaker trips — and it will — the whole house goes dark. You can add more appliances, but you cannot add more circuit breakers without rewiring. The Python system had reached that point. Five hard limits had emerged that no amount of refactoring would resolve. The scheduler was macOS-only, making deployment to the intended Raspberry Pi 5 impossible without a complete replacement. There was no admin surface — configuration meant editing TOML files, monitoring meant grepping logs. The AI inference plane and the user-facing plane shared a single process boundary, so a crash in Ollama inference took down Slack event handling. Multi-user isolation was nominal, with FSRS cards keyed by a global default_user that would require surgery to extend. And geographic availability was zero: if the Mac was sleeping, quizzes did not arrive.
The decision was made to decompose the system — not by rewriting everything, but by identifying the correct boundary: Cloudflare owns all user-facing logic; the local device owns all AI and model inference.
The target architecture:
Slack DM / Teams DM / Email
│
▼
Cloudflare Workers (4 workers)
brain-reps-worker — delivery, answer eval, gate, slash commands
brain-reps-kg-worker — Neo4j → D1 mirror sync
brain-reps-ingester — PDF ingestion pipeline
brain-reps-admin — REST API + audit log
│
Cloudflared Tunnel (shared-secret auth)
│
▼
Local Router (FastAPI shim) Qdrant · Neo4j · Ollama · Langfuse
A single constraint shaped every subsequent decision: Cloudflare free tier only. No Durable Objects, no paid features, no Hyperdrive. 100k requests/day, 1k KV writes/day, 5 GB D1.
The Cloudflare Migration: Six Days, Four Workers
What followed was a systematic branch-by-branch construction. Each branch had a single clear goal and could not merge until the user manually validated it and wrote approved.
Infra Baseline (April 24)
A pnpm monorepo with four worker stubs, shared TypeScript types, Biome for linting (replacing ESLint + Prettier with a single binary), and strict TypeScript flags from day one: exactOptionalPropertyTypes, noUncheckedIndexedAccess, noUnusedLocals. CI wired to lint, typecheck, and wrangler dry-run on every push.
The lesson from enabling strict flags immediately rather than retrofitting: the TypeScript fights that came later were manageable precisely because the contract was established upfront.
D1 Schema (April 25)
A 10-table schema designed not just for the current feature set but for the analytics the admin dashboard would need. The key design was quiz_configs: a 4-scope precedence hierarchy — learner > group > tier > global — that allows one global default row to be overridden at any level without denormalisation. A single ORDER BY CASE scope WHEN 'learner' THEN 1 … query resolves the correct config for any learner in any context.
Local Router and Compose Tunnel (April 25)
The local AI stack (Qdrant, Neo4j, Ollama, Langfuse) was wrapped in a FastAPI router with a shared-secret authentication guard. Crucially, the cloudflared tunnel was moved from a launchd daemon into a Docker Compose service. docker compose up -d now works identically on the MacBook and on the Pi 5. Pi 5 resource overrides — reduced memory caps, GDS plugin removed from Neo4j — live in a separate compose overlay file.
Worker Core (April 26)
Slack event handling with HMAC-SHA256 constant-time signature verification and ±5-minute replay prevention. Quiz delivery: load learner → KV cache-aside → resolve config → due cards → embed → search → generate → post DM → insert pending answers. Six Langfuse tracing gaps closed — real HTTP timing, required learner attribution, token usage from Ollama prompt_eval_count/eval_count, service.name in metadata, model parameters, and a separate Qdrant retrieval span.
The KV write budget constraint produced a clean design: KV is a hint cache only. D1 is the authoritative gate state. A 50-learner cohort uses ≤150 KV writes/day for cache warming against a 1,000/day budget.
Learner Gating and FSRS (April 26)
The one-at-a-time gate is implemented as a single atomic D1 upsert: INSERT … ON CONFLICT DO UPDATE WHERE status = 'open'. No SELECT-then-INSERT race. No distributed lock. A concurrent delivery call simply reads false and skips delivery. The gate state lives in D1; KV holds a 5-minute closed hint purely as a fast-path guard for back-to-back cron windows.
FSRS moved from Python's py-fsrs to TypeScript's ts-fsrs, eliminating a tunnel round-trip for every card review. review_log captures every rating event — the raw material for future retention analytics.
Slash Commands (April 26)
Five commands — /brainhelp, /brainstatus, /brainnow, /brainpause, /brainresume — all wired through Slack's 3-second response deadline constraint. The pattern: acknowledge immediately with a lightweight response, defer all heavy work via ctx.waitUntil. /brainnow returns "⏳ Checking..." in under 500 ms, then posts the real quiz to response_url after delivery completes.
Idempotency: every slash command writes a KV key with a 10-minute TTL. Slack retries on 5xx — the second invocation reads the key and returns "Already processed" without re-executing.
KG Worker, Ingester, Admin, UI, Metrics (April 26–27)
The KG worker runs a 6-hour cron that mirrors Neo4j concepts and intersections into D1, making them available to Workers without a tunnel call. The ingester handles PDF jobs via Cloudflare Queues, decoupling ingestion from the delivery path. The admin worker provides a full REST API with Cloudflare Access JWT authentication and an audit log on every write. The admin UI — seven pages, zero build step, pure HTML with htmx and Alpine.js — delivers learner management, corpus overview, feedback triage, and FSRS analytics.
Email and Teams (April 27–30)
The channel-agnostic adapter pattern established in the Python era paid off here. Both channels implement the same handleEmailAnswer/handleTeamsAnswer interface — HMAC signature verification, learner lookup by channel identifier, answer evaluation via the same FSRS path. Neither required a D1 migration; email and teams_user_id columns were already in 0001_initial.sql.
What Was Actually Built
Ten days from first commit. One person. Zero cloud budget.
The ingestion pipeline takes a PDF through Docling, embeds it with snowflake-arctic-embed2 (1024-dim) into Qdrant, and extracts concepts and cross-domain intersections into Neo4j — classified against a 30-subdomain biomedical vocabulary into a domain hierarchy with pairwise and N-way (3–5) intersection detection, strength-scored edges, and a live D1 mirror for query efficiency.
Delivery runs on a Cloudflare Worker cron every 10 minutes. FSRS scheduling selects due cards, a one-at-a-time gate prevents concurrency collisions, and quiz generation draws on RAG over Qdrant and Ollama — with the result posted to the learner across whichever channel they prefer: Slack DM, email via Resend, or Microsoft Teams.
When an answer arrives, an LLM judge evaluates it with structured JSON output, assigns an FSRS rating, schedules the next review interval, and appends to a review_log that accumulates the raw material for retention analytics. Every LLM call is traced in Langfuse with real timing, token usage, learner attribution, and a separate Qdrant retrieval span.
The admin surface covers the full operational lifecycle: a REST API with Cloudflare Access JWT authentication, an audit log on every write, seven dashboard pages, feedback triage, and per-learner quiz configuration with 4-scope precedence. Operations are handled by learners themselves: /brainstatus, /brainpause, /brainresume. The entire AI stack deploys with one command — docker compose up -d — on either the MacBook or the Pi 5.
What This Project Demonstrates
Constraint as design input, not obstacle. There is a temptation, when resources are abundant, to solve problems by adding infrastructure. The Cloudflare free tier — 1k KV writes/day, 100k requests/day, 5 GB D1 — removed that option entirely. What emerged was not a compromise but a cleaner architecture: KV as hint cache only, D1 as the authoritative truth, atomic gate acquisition with no SELECT-then-INSERT race, deferred heavy work via ctx.waitUntil. Every design decision was forced into clarity by the constraint that preceded it. A tight budget is an unusually effective design tool.
The discipline of working backward from a known point. The five-bug feedback silence required a diagnostic practice that most engineering documentation never names explicitly: when you do not know where the system breaks, insert a known-good log line, observe what fires, and work backward from where it stops. This is not a debugging trick — it is a method for reasoning under uncertainty. The architecture that emerged from those five bugs is more legible precisely because every failure had to be fully understood before the next step was taken.
The plane separation principle. Perhaps the hardest architectural decision in the entire project was recognising that the AI inference plane and the user-facing plane should not share a process boundary, a failure domain, or a deployment lifecycle. Think of it as the difference between the kitchen and the dining room of a restaurant: patrons must never feel a fire in the kitchen, and a full dining room must never prevent the kitchen from cooking. The tunnel pattern achieves this separation cleanly — and does so without requiring the AI stack to move to the cloud, where it would cost money and lose access to the local GPU.
Schema as product thinking. The D1 schema was designed before a single line of worker code was written. The quiz_configs 4-scope hierarchy, the review_log analytics foundation, the learner_channel_prefs multi-channel support — these were not afterthoughts bolted on when a feature demanded them. They were intentional product decisions expressed in SQL, which meant every subsequent feature had a home before it was built. The schema was the product roadmap.
Iterative design grounded in actual failure. Each iteration in this project was driven not by hypothetical futures but by a specific, observable breakdown in the present version. The domain classifier returned nonsense — so the classification architecture changed. The feedback loop fired silently — so the async patterns changed. The monolith hit five hard limits simultaneously — so the entire deployment model changed. The system that exists at the end of ten days is the residue of those failures. To put this differently: the design was not planned; it was earned.
What Comes Next
The system is ready to deploy. The Cloudflare Workers infrastructure is tested, the local compose stack runs on Mac and Pi 5, and the admin dashboard covers the full operational surface. What remains is moving from local branches to a live deployment — running wrangler deploy for each worker, applying D1 migrations to production, and setting Cloudflare Access on the admin UI.
Beyond deployment: the review_log has been accumulating rating events since the first session. The depth/range metrics infrastructure — daily snapshots of concept coverage, mastery vectors, domain grasp scores — is wired. The first real analytics question — which concepts is a given learner weakest on, and which cross-domain intersections are underexplored? — has all the data it needs to be answered.
The tutor is ready. The learner is next.