Skip to main content

Blog

Three resolutions for your AI's memory. Pick one for your context budget.

5 mins

Until v0.13, every Ogham wiki preamble was the same shape: the full compiled body. Useful on the first call. Expensive to inject on every turn. About 1,500 tokens per page. Top-3 by default. So the preamble alone runs past 4,000 tokens before you’ve seen a single hit.

v0.13 splits the cached summary into three forms. One sentence, one paragraph, full body. You pick which one rides along on each retrieval.

Compile your memory. Take it with you.

5 mins

For the last couple of months, Ogham has been about pushing memory in – store this, link this, retrieve this. v0.12 flips the question. You can now ask “what does my memory know about X” and get back one coherent page, synthesized from every memory carrying that tag. Then you can dump the whole layer to a folder of plain markdown and open it in Obsidian.

A wiki layer over your memory, and an exporter that hands the wiki layer to you as files you own.

Three weeks, three releases, same numbers

4 mins

Three weeks ago I posted R@10 = 0.737 on BEAM and 91.8% QA on LongMemEval. Since then v0.10 shipped intent-gated reformulation, v0.11 shipped migration 026 (memory lifecycle), migration 028 (topic summary cache), Phase 4 prompt-injection guards, and a 2.4x batch-INSERT perf fix. The default embedding dim moved from 512 to 768.

Heavy change. Time to re-measure.

The scoreboard #

Same benchmarks, same 400 and 500 questions, run this week on v0.11 with the code that’s going to PyPI today.

Your AI logs show who used it. They don't show what it remembered.

7 mins

We spent a weekend running Ogham through someone else’s benchmark. Here’s what happened.

Last week we published that Ogham hits 99.5% Recall@10 on LongMemEval – the right memory chunk lands in the top 10 results for nearly every question. Good number. We were pleased with ourselves.

Then we ran the same 500 questions through the AMB benchmark harness, built by the Vectorize team (the people behind Hindsight), where a strict LLM judge scores the final answer – not just whether we found the right chunk.

A 4.5B model on a laptop just read a wind turbine power curve

7 mins

Back in March, we tested whether Gemini Embedding 2 could survive MRL compression from 3072 to 512 dimensions on cross-modal retrieval. It did - a PNG power curve, a CSV of maintenance costs, and a text spec all mapped into the same 512-dimensional vector space. The embeddings worked.

But embeddings are only half the problem. Once you retrieve a memory that includes a PNG, something has to actually read it. Last time that was Gemini Flash - a cloud API. This time, it’s Gemma 4 running on my laptop.

BEAM benchmark - a fair look at where we stand on long-term memory

8 mins

A few weeks ago I ran BEAM – the long-term memory benchmark from Tavakoli et al. – against Ogham for the first time. The result was a retrieval-only number, R@10 = 0.689 on the 100K bucket, with a few categories sitting embarrassingly low.

This week I shipped v0.9.0 with a stack of context-engineering features (timeline tables, multilingual entity extraction across 18 languages, session boundary headers, preference detection, Lost-in-the-Middle reordering). Then I built a batch-API harness on top of OpenAI’s reasoning models so I could finally measure end-to-end QA accuracy, not just retrieval.

From 62% to 92% - what we learned about reading, not retrieval

5 mins

Your vector search found the right memories. Your LLM still got the answer wrong.

That was us last week. We run LongMemEval – 500 questions that test whether an AI can answer questions from its own conversation history. Retrieval was at 97.2% R@10. The memories were there. The LLM could only answer 62.4% of questions correctly.

Now we’re at 91.8%. Same memories. Same embeddings. No fine-tuning. The retrieval didn’t change at all. Everything that mattered happened between “found the memory” and “answered the question.”

Zero-cost retrieval upgrade: fixing our own fusion math

3 mins

We found a bug in our own search pipeline. Not a crash – more of a “this has been leaving performance on the table for weeks” kind of thing.

Our hybrid search combines two signals: dense vector similarity (Voyage embeddings via pgvector) and keyword matching (PostgreSQL tsvector full-text search). We called the fusion method “RRF” in our docs. It wasn’t.

What was wrong #

The old fusion code did this:

One config flag, 7% better retrieval

3 mins

We added optional cross-encoder reranking to Ogham’s search pipeline. One environment variable, a 21MB model, and our BEAM benchmark scores went from 0.65 to 0.70 R@10.

Here’s what happened.

The gap #

Our retrieval pipeline uses Voyage embeddings and hybrid search – dense vectors plus full-text search, fused with Reciprocal Rank Fusion. It gets the job done. 97.2% recall on LongMemEval, 0.65 R@10 on BEAM’s 400-question benchmark.

Giving Google SCION agents shared memory

5 mins

Steve Yegge’s Gas Town tackles state with git-backed hooks and a bead-tracking ledger. Google’s SCION isolates agents in Docker containers. Both solve orchestration. Neither has semantic memory that agents can search by meaning.

We tested SCION. It runs LLM agents in containers – a researcher, a coder, a reviewer, each in their own sandbox. Well-designed, and one obvious gap.

The agents can’t talk to each other.

Each container is isolated. Agent A doesn’t know what Agent B learned. When an agent stops, everything it figured out stays locked in its container. Start a new agent for the next task and you’re back to zero.