On Designing Around a Hard Resource Limit

Three hard limits, memory, time, and write quota, and one pattern they all share: constraints are design input, not obstacles. How treating each ceiling as a forcing function produced cleaner architecture than the unconstrained version would have.

On Designing Around a Hard Resource Limit

Every system has ceilings. The interesting question is not how to raise them, but how to design so they don't matter.

Building an AI-powered spaced repetition system this year, I ran into three hard limits in quick succession. Each one taught the same underlying lesson from a different angle.

The first was memory. Running a large language model locally alongside a vector database, a graph database, and an observability stack, I kept hitting mysterious timeout errors during document ingestion. The model appeared to be running. The code was fine. The problem was that the total weight of everything loaded into memory exceeded what the machine had available, so the OS was quietly paging model weights in and out mid-inference, killing long-running connections in the process. The fix wasn't to get more RAM. It was to switch to a smaller, quantized model that left genuine headroom for the rest of the stack. A model that fits comfortably and runs reliably beats a larger model that thrashes every time. The total memory footprint, rather than raw capability, is the right optimization to target when you have a fixed memory budget.

A list of open-source Large Language and Embedding Models
Open source models running on a local device

The second was time. Slack gives you three seconds to acknowledge an incoming event, or it retries. LLM evaluation and vector search take five to thirty seconds. These two facts cannot be reconciled by making the work faster. The only solution is to stop treating acknowledgement and completion as the same thing. Respond immediately to say you received the request, then do the real work asynchronously and deliver the result via a separate push. Once you internalize this pattern, you start seeing it everywhere: the requester's deadline and your processing time are independent variables, and conflating them is a design mistake.

A mobile screenshot of a Slack chat interface displaying a multiple choice question.
Getting a response to the question means the api request requires >45 seconds before a response from the local model.

The third was the write quota. The caching layer I was using had a daily write limit. Naively caching state on every operation across a cohort of learners would have exhausted it before the day was out. The temptation here is to either pay for a higher tier or quietly accept that things will break at scale. Instead, I made the constraint explicit in the architecture. I decided which layer needed to be the authoritative source of truth and relegated the cache to a fast-path hint that was explicitly allowed to be stale. Cache writes became rare and intentional rather than automatic. The limit stopped being a bottleneck because the design no longer pushed against it.

The pattern across all three is the same: a hard limit forces you to be precise about something you were previously hand-waving. Memory pressure forces you to think carefully about the total footprint, not just the component you're adding. An API response deadline forces you to separate acknowledgement from completion. A write quota forces you to define exactly what needs to be consistent and what is allowed to lag. In each case, the constraint produced a cleaner architecture than the unconstrained version would have.

Constraints are design inputs. The earlier you see them that way, the better the system you end up with.