Skip to main content

Blog

Your AI logs show who used it. They don't show what it remembered.

7 mins

We spent a weekend running Ogham through someone else’s benchmark. Here’s what happened.

Last week we published that Ogham hits 99.5% Recall@10 on LongMemEval – the right memory chunk lands in the top 10 results for nearly every question. Good number. We were pleased with ourselves.

Then we ran the same 500 questions through the AMB benchmark harness, built by the Vectorize team (the people behind Hindsight), where a strict LLM judge scores the final answer – not just whether we found the right chunk.

A 4.5B model on a laptop just read a wind turbine power curve

7 mins

Back in March, we tested whether Gemini Embedding 2 could survive MRL compression from 3072 to 512 dimensions on cross-modal retrieval. It did - a PNG power curve, a CSV of maintenance costs, and a text spec all mapped into the same 512-dimensional vector space. The embeddings worked.

But embeddings are only half the problem. Once you retrieve a memory that includes a PNG, something has to actually read it. Last time that was Gemini Flash - a cloud API. This time, it’s Gemma 4 running on my laptop.

BEAM benchmark - a fair look at where we stand on long-term memory

8 mins

A few weeks ago I ran BEAM – the long-term memory benchmark from Tavakoli et al. – against Ogham for the first time. The result was a retrieval-only number, R@10 = 0.689 on the 100K bucket, with a few categories sitting embarrassingly low.

This week I shipped v0.9.0 with a stack of context-engineering features (timeline tables, multilingual entity extraction across 18 languages, session boundary headers, preference detection, Lost-in-the-Middle reordering). Then I built a batch-API harness on top of OpenAI’s reasoning models so I could finally measure end-to-end QA accuracy, not just retrieval.

From 62% to 92% - what we learned about reading, not retrieval

5 mins

Your vector search found the right memories. Your LLM still got the answer wrong.

That was us last week. We run LongMemEval – 500 questions that test whether an AI can answer questions from its own conversation history. Retrieval was at 97.2% R@10. The memories were there. The LLM could only answer 62.4% of questions correctly.

Now we’re at 91.8%. Same memories. Same embeddings. No fine-tuning. The retrieval didn’t change at all. Everything that mattered happened between “found the memory” and “answered the question.”

Zero-cost retrieval upgrade: fixing our own fusion math

3 mins

We found a bug in our own search pipeline. Not a crash – more of a “this has been leaving performance on the table for weeks” kind of thing.

Our hybrid search combines two signals: dense vector similarity (Voyage embeddings via pgvector) and keyword matching (PostgreSQL tsvector full-text search). We called the fusion method “RRF” in our docs. It wasn’t.

What was wrong #

The old fusion code did this:

One config flag, 7% better retrieval

3 mins

We added optional cross-encoder reranking to Ogham’s search pipeline. One environment variable, a 21MB model, and our BEAM benchmark scores went from 0.65 to 0.70 R@10.

Here’s what happened.

The gap #

Our retrieval pipeline uses Voyage embeddings and hybrid search – dense vectors plus full-text search, fused with Reciprocal Rank Fusion. It gets the job done. 97.2% recall on LongMemEval, 0.65 R@10 on BEAM’s 400-question benchmark.

Giving Google SCION agents shared memory

5 mins

Steve Yegge’s Gas Town tackles state with git-backed hooks and a bead-tracking ledger. Google’s SCION isolates agents in Docker containers. Both solve orchestration. Neither has semantic memory that agents can search by meaning.

We tested SCION. It runs LLM agents in containers – a researcher, a coder, a reviewer, each in their own sandbox. Well-designed, and one obvious gap.

The agents can’t talk to each other.

Each container is isolated. Agent A doesn’t know what Agent B learned. When an agent stops, everything it figured out stays locked in its container. Start a new agent for the next task and you’re back to zero.

The missing primitive: why AI agents need persistent memory

3 mins

Nate B Jones put out a video yesterday breaking down three tools Anthropic just shipped – Dispatch, Computer Use, and Scheduled Tasks. The tools are worth knowing about. But the more interesting thread is something Nate keeps circling back to between the demos.

He calls it “Open Brain.” A database you control. Cheap. Almost free. Where your AI stores what works and what doesn’t, and that knowledge compounds over time.

We’ve been building that. It’s called Ogham.

We ripped out litellm in one afternoon

3 mins

On Monday, litellm versions 1.82.7 and 1.82.8 hit PyPI with credential-stealing malware baked in. The attack payload collected SSH keys, .env files, cloud credentials, Kubernetes configs, and shell history, encrypted everything with RSA-4096, and sent it to a fake domain. Then it tried to plant persistent backdoors in your kube-system namespace and at ~/.config/sysmon/sysmon.py.

We had litellm in our API gateway. Version 1.82.4 – safe, but one careless pip install --upgrade away from compromise.

Claude now remembers things. Here's what it doesn't remember.

5 mins

Anthropic shipped three features recently that matter if you care about AI memory: auto-memory for session-to-session learning, auto-dream for background memory consolidation, and auto mode for more autonomous tool use.

Auto-memory landed in v2.1.59. Claude now writes notes to itself between sessions – build commands it discovered, debugging patterns, your code style preferences. These live as markdown files in ~/.claude/projects/<project>/memory/ and the first 200 lines get loaded at the start of every conversation.