On April 5th, Mnemosyne was a single Python file with a SQLite database. By April 29th, it was 292 tests, eight architectural phases, and a complete rewrite of almost everything. This is the story of what happened in between, and why it took longer than I expected.
Phase 1: The foundation
The first phase was just making the basic system solid. SQLite schema design. CRUD operations. A simple CLI. I thought this would take a day. It took three. Database design is one of those things where every decision echoes forward forever. Get the schema wrong and you're migrating data for the rest of the project's life. I got it mostly right, which meant I only had to do one major migration later.
Phase 2: Structured extraction
Phase 2 was where things got interesting. Instead of just storing raw conversation text, I built an LLM-driven extraction system that pulls facts, preferences, and entities out of conversations automatically. This was the first time Mnemosyne started feeling intelligent rather than just archival. It was also the first time I had to deal with LLM hallucinations in a memory context. The extractor sometimes invented facts that were never said. Fixing that required confidence scoring and human-in-the-loop verification.
Phases 3-4: Scoring and banks
Phase 3 added configurable hybrid scoring, combining vector similarity with full-text relevance. Phase 4 added memory banks: isolated contexts for different projects or users. Both phases seemed straightforward on paper. Both involved rewriting the query engine from scratch. The hybrid scoring in particular required careful tuning: too much vector weight and you miss exact matches; too much text weight and you miss conceptual similarity. I spent two days just adjusting parameters against the benchmark suite.
Phase 5: Isolation
Memory bank isolation was technically simple but conceptually important. Each bank gets its own namespace in the database. Memories don't leak between banks. This matters when you're running multiple projects or sharing a machine. The implementation was a few lines of SQL. The testing was extensive: I had to prove that bank A could never see bank B's memories, even with creative query attempts.
Phases 6-7: MCP and tools
Phase 6 added an MCP server, letting Mnemosyne expose its tools through a standard protocol. Phase 7 expanded the tool suite: update, forget, search with filters, temporal queries. This was where the API surface started getting large enough that I needed real documentation. I also hit the first performance bottleneck: embedding generation is slow, and doing it synchronously blocks the agent. I added caching and batching, which helped but didn't fully solve the problem.
Phase 8: Streaming and patterns
The final phase before 2.0 was the most ambitious. Streaming memory updates (so the agent doesn't wait for the database). Pattern recognition (detecting repeated topics and preferences). Plugin integration (the saga documented in another post). This phase broke more things than all the previous phases combined. The streaming code introduced race conditions. The pattern recognition needed its own data structures. The plugin system required rewriting the initialization flow entirely.
What broke
The CLI broke three times. The database migration script failed on Windows paths. The test suite had four flaky tests that only failed in CI, never locally. The embedding cache grew without bound until I added eviction. The plugin registration failed silently in some Hermes configurations. The `sleep()` consolidation function sometimes double-processed memories. The list goes on.
Each fix taught me something. The flaky tests taught me about mocking LLM calls properly. The path issues taught me about cross-platform Python. The cache eviction taught me that "unbounded growth" is a bug you don't find until production. Every broken thing was a lesson I needed to learn.
Why it took 24 days
I originally thought 2.0 would take a week. Maybe two. It took 24 days because software doesn't care about your estimates. Every phase revealed problems I didn't know existed. Every fix uncovered edge cases. Every optimization required measurement, tuning, and verification. The 292 tests in the final suite aren't there to look impressive. They're there because I broke things enough times that I stopped trusting myself without proof.
The result is something I'm genuinely proud of. Not because it's perfect, but because it's honest. It does what it says. It doesn't hide your data in someone else's cloud. It remembers things without being creepy about it. And it started as a Saturday afternoon frustration and grew into something that might actually be useful to other people. That's worth 24 days of my life. I think it's worth a few minutes of yours to try it out.