Benchmarks

Table of Contents

97.2% Recall@10 on LongMemEval #

Tested against the LongMemEval benchmark (Wu et al., ICLR 2025) – 500 questions across 5 memory abilities, 124,342 memories ingested. No LLM in the retrieval pipeline. One PostgreSQL query.

Category	R@10	MRR	Questions
single-session-assistant	100%	100%	56
knowledge-update	100%	97.4%	78
single-session-user	98.6%	89.8%	70
multi-session	97.3%	90.2%	133
single-session-preference	96.7%	87.5%	30
temporal-reasoning	93.5%	85.9%	133
Overall	97.2%	91.1%	500

How other systems score #

Most memory systems report end-to-end QA accuracy (retrieval + LLM reads and answers). That measures the whole pipeline, not just retrieval.

System	QA Accuracy	Architecture
OMEGA	95.4%	Classification + extraction pipeline
Observational Memory (Mastra)	94.9%	Observation extraction + GPT-5-mini
Hindsight (Vectorize)	91.4%	4 memory types + Gemini-3
Zep (Graphiti)	71.2%	Temporal knowledge graph + GPT-4o
Mem0	49.0%	RAG-based

Ogham’s 97.2% is retrieval R@10 – did the correct session appear in the top 10 results, with no LLM interpreting the content. The LongMemEval paper reports 78.4% as their best retrieval baseline. Other retrieval systems that report similar R@10 numbers typically use cross-encoder reranking, NLI verification, knowledge graph enrichment, and LLM-as-a-judge pipelines. Ogham reaches 97.2% with one Postgres query.

What Ogham does differently #

Hybrid search combining pgvector cosine similarity with tsvector keyword matching via CCF (Convex Combination Fusion)
Entity-centric bridge retrieval for multi-hop temporal queries (“how many months between X and Y”)
Gaussian decay temporal re-ranking with directional penalty
halfvec HNSW index (float16 compression, roughly half the size of float32)
No neural rerankers, no knowledge graph enrichment, no query expansion, no LLM calls during search

Benchmark setup #

Embedding: Voyage AI voyage-4-lite at 512 dimensions
Index: halfvec HNSW (float16)
Database: PostgreSQL 17 with pgvector 0.8.2
Cache hit ratio during benchmark: 97.5%
CPU usage: under 4%
Zero errors across all 500 questions

Operational performance #

Latency, throughput, and provider comparisons across Supabase Cloud, Neon serverless, and a self-hosted PostgreSQL instance, tested on an M1 Mac and a low-power Intel NAS with local and remote embedding providers.

Before you read the tables #

The short version: self-hosting your database makes search fast. Hybrid search goes from ~110ms on Supabase Cloud to ~23ms on a local PostgreSQL instance. That 65-110ms we kept seeing on every machine? Network overhead. The database itself is quick.

A fanless Intel NAS imports 1000 memories in 33 seconds with OpenAI embeddings. Without a GPU, local embeddings take 3-4 seconds each – fine for one-at-a-time use, painful for bulk imports.

If you have a GPU, Ollama keeps everything local. If you don’t, using a cloud embedding provider (OpenAI, Mistral, or Voyage AI) is faster than local inference on both machines we tested.

Test platforms #

	Apple M1 (Mac)	Intel N305 (NAS)
CPU	4P + 4E cores	8 E-cores (Alder Lake-N)
GPU acceleration	Metal (Neural Engine)	None
Ollama inference	GPU	CPU
RAM	16 GB unified	4 GB DDR5

The N305 is a low-power NAS chip. Ollama has no GPU path for it – no CUDA, no ROCm, no Metal – so every embedding is pure CPU work.

Resource footprint #

What	Size
Docker image	243 MB
Model download (once)	~600 MB
Model in GPU memory (Metal)	<200 MB
Embedding, first call (M1)	~500ms
Embedding, warm (M1)	70-120ms

On Apple Silicon, the model sits in GPU memory via Metal. Ollama unloads it after 5 minutes idle, so the next call takes ~500ms to reload.

Baseline latency (10 memories, M1 + Ollama) #

Run with tests/bench.py:

Operation	Mean	Median
Embedding (cold)	~90 ms	~85 ms
Embedding (cached)	<0.1 ms	<0.1 ms
Store memory	~65 ms	~64 ms
Vector search (DB only)	~65 ms	~64 ms
Hybrid search (DB only)	~65 ms	~64 ms
Auto-link (HNSW scan)	~65 ms	~65 ms
Explore graph (search + CTE)	~69 ms	~69 ms
Get related (CTE traversal)	~65 ms	~64 ms

Search timings exclude embedding generation to show database latency on its own. Graph operations add almost nothing on top of base search – the recursive CTEs are essentially free at this scale.

Stress test (1000 memories, M1 + Ollama) #

Run with tests/bench_stress.py. Imports real conversation memories from a 4,588-memory export, then benchmarks operations at scale. The SQLite embedding cache is cleared first so the cold import measures true embedding latency.

Phase	Result
Cold import (no dedup)	58.6s total, 58.6 ms/memory
Re-import (dedup)	2.7s total, 2.7 ms/memory – 1000/1000 skipped
Auto-link backfill	7 memories linked in 4 batches, 1.5s total
Hybrid search	117 ms mean, 109 ms median
Explore graph (search + CTE)	110 ms mean, 112 ms median
Get related (CTE traversal)	85 ms mean, 83 ms median

At the default 0.85 similarity threshold, auto-linking is conservative – only 7 out of 1000 conversational memories were similar enough to link. Dropping to 0.7 found 52 links across 4,242 memories in production. The threshold is configurable per call.

Cross-platform comparison #

We tested four configurations to separate hardware performance from network latency.

Embedding throughput (cold, per memory) #

Config	10 memories	20 memories	50 memories	1000 memories
M1 + OpenAI	–	–	–	24.1 ms
N305 + OpenAI	–	–	–	33.2 ms
M1 + Ollama	~58 ms	~58 ms	~58 ms	58.6 ms
N305 + Ollama	3,160 ms	2,827 ms	4,380 ms	–

M1 + Ollama stays consistent regardless of batch size. The N305 running Ollama gets worse the longer it runs – likely thermal throttling on those fanless E-cores. At 50 memories, the cold import takes 3m 39s compared to about 3 seconds on the M1. We gave up at 50 on N305 + Ollama. Extrapolating from the 4.4s/memory average, 1000 memories would have taken over an hour.

OpenAI comes in at 24ms/memory on the M1 and 33ms on the N305 – faster than Ollama on either machine. Not shocking given that OpenAI’s inference servers have considerably more compute than a 2020 Mac, but worth confirming: a network round-trip to their API is still quicker than running the model locally.

Stress test with OpenAI (1000 memories) #

Phase	M1 + OpenAI	N305 + OpenAI
Cold import (no dedup)	24.1s, 24.1 ms/memory	33.2s, 33.2 ms/memory
Re-import (dedup)	4.6s, 4.6 ms/memory	4.4s, 4.4 ms/memory
Auto-link backfill	8 links, 2 batches, 1.9s	11 links, 2 batches, 0.5s
Hybrid search	104 ms mean	113 ms mean
Explore graph	103 ms mean	79 ms mean
Get related	83 ms mean	68 ms mean

Search and graph operations #

Operation	M1 + Ollama	N305 + Ollama (50)	M1 + OpenAI	N305 + OpenAI
Hybrid search	117 ms	106 ms	104 ms	113 ms
Explore graph	110 ms	97 ms	103 ms	79 ms
Get related	85 ms	65 ms	83 ms	68 ms

All runs are 1000 memories except N305 + Ollama (50, too slow to go higher). These numbers all land in the same 65-117ms range regardless of configuration. That’s Supabase network latency, not hardware.

Self-hosted PostgreSQL (M1 + Ollama, 1000 memories) #

Same stress test, but against a self-hosted PostgreSQL + pgvector instance on the local network – PostgREST running in an LXC container, no Supabase Cloud in the loop.

Phase	Supabase Cloud	Self-hosted	Speedup
Cold import (no dedup)	58.6 ms/memory	44.1 ms/memory	1.3x
Re-import (dedup)	2.7 ms/memory	1.2 ms/memory	2.3x
Auto-link backfill	7 links, 1.5s	22 links, 1.1s	–
Hybrid search	117 ms mean	23 ms mean	5.1x
Explore graph	110 ms mean	16 ms mean	6.9x
Get related	85 ms mean	10 ms mean	8.5x

That’s the number that jumped out. What we thought was database latency was almost entirely network round-trips. The actual HNSW scan plus recursive CTE traversal? 5-23ms.

The self-hosted run found more auto-links (22 vs 7) due to PostgREST schema cache timing differences during the run – not meaningful, just a side effect of the test setup.

PostgreSQL 17 with pgvector in an LXC container, PostgREST as the API layer, M1 Mac running Ollama for embeddings. The BYODB guide walks through the setup.

Neon serverless (M1 + OpenAI, 1000 memories) #

Same stress test against Neon serverless Postgres (EU-Central-1, Frankfurt). Ogham connects via psycopg through Neon’s connection pooler – no PostgREST, no Supabase client library in the path.

We ran this as a destruction test: dropped the entire schema, re-applied schema_postgres.sql from scratch, then ran the full benchmark on an empty database.

Phase	Supabase Cloud	Neon (Frankfurt)	Difference
Cold import (no dedup)	24.1 ms/memory	37.8 ms/memory	Neon slower (embedding cache state)
Re-import (dedup)	4.6 ms/memory	3.7 ms/memory	1.2x faster
Auto-link backfill	8 links, 1.9s	16 links, 1.5s	–
Hybrid search	104 ms mean	105 ms mean	Same
Explore graph	103 ms mean	80 ms mean	1.3x faster
Get related	83 ms mean	71 ms mean	1.2x faster

The cold import number is higher because the embedding cache was in a different state between runs – import speed is dominated by embedding generation, not database writes. Search and graph queries are the more useful comparison, and there Neon was comparable to or faster than Supabase Cloud.

Graph operations improved the most. explore_knowledge dropped from 103ms to 80ms, and get_related from 83ms to 71ms. Both skip the PostgREST serialization layer and go straight to Postgres, so the recursive CTE results come back faster.

Two things explain the difference. Neon’s Frankfurt region is closer to our test machine than Supabase’s AWS Ireland. And the direct psycopg connection cuts out PostgREST – one fewer hop between the application and the database.

Self-hosted PostgreSQL on the local network is still faster at 5-23ms. Neon lands between Supabase Cloud and self-hosted.

Embedding provider comparison (Neon, M1, 500 memories) #

Same stress test, all four embedding providers, all against Neon. Cloud providers hit their APIs over the network; Ollama ran locally on the M1 with Metal acceleration. Each run wipes everything and starts from scratch – fresh data, cleared embedding cache, 500 memories imported cold.

Mistral’s mistral-embed is locked at 1024 dimensions. Voyage’s voyage-4-lite gives you a few fixed sizes (256, 512, 1024, 2048). OpenAI takes anything up to 1536. Ollama’s embeddinggemma supports 128, 256, 512, and 768 via MRL truncation. We ran 512 and 1024 where we could.

Metric	Ollama 512	OpenAI 512	OpenAI 1024	Mistral 1024	Voyage 512	Voyage 1024
Cold import	85.0 ms/mem	36.3 ms/mem	39.6 ms/mem	50.8 ms/mem	36.6 ms/mem	40.1 ms/mem
Dedup re-import	2.2 ms/mem	2.9 ms/mem	5.2 ms/mem	5.5 ms/mem	2.7 ms/mem	5.9 ms/mem
Auto-links (0.85)	29	16	13	468	62	63
Hybrid search	92.4 ms	97.1 ms	109.4 ms	110.0 ms	95.4 ms	117.1 ms
Explore graph	85.4 ms	79.8 ms	81.3 ms	101.5 ms	80.6 ms	91.1 ms
Get related	71.9 ms	71.3 ms	75.8 ms	71.7 ms	72.4 ms	71.3 ms

Import speed #

OpenAI and Voyage at 512 dims are essentially tied – 36ms per memory. Mistral trails at 50.8ms because it caps you at about 32 texts per request (16K token limit), so you’re making 15x more API calls than OpenAI or Voyage, which take 500+ texts in one go. Ollama at 85ms per memory is the slowest – embeddings run locally, not on a cloud GPU cluster. This was on an M1 Mac with Metal acceleration; expect slower on CPU-only hardware. No API cost though, and nothing leaves your machine.

Search latency #

Halving the dimensions from 1024 to 512 cuts 15-20% off search times. Ollama’s 92ms hybrid search edged out the cloud providers, but Voyage at 95ms and OpenAI at 97ms are close enough that it doesn’t matter in practice. At 1024, all three cloud providers sit in the 109-117ms range – you’re measuring Neon round-trips at that point, not vector math.

Auto-linking behaviour #

This was the surprise. The same 0.85 cosine similarity threshold produces wildly different results depending on the provider:

Mistral found 468 links. Its vectors pack related content close together – the embedding space is tight.
Voyage found 62-63 links, same at both 512 and 1024 dims.
Ollama landed at 29 links – moderate spread, sitting between Voyage and OpenAI.
OpenAI found 13-16 links. Widest spread of the four, which lines up with what we saw switching from Ollama to OpenAI – a threshold that worked for one provider found almost nothing with another.

If you want a similar density of auto-links across providers, tune the threshold: around 0.65-0.70 for OpenAI, 0.75-0.80 for Ollama, 0.70-0.75 for Voyage, and 0.85 works as-is for Mistral. It’s configurable per call.

Provider specs #

Provider	Model	Batch limit	Dimensions
OpenAI	text-embedding-3-small	2,048 inputs	Any up to 1536
Mistral	mistral-embed	~32 inputs (16K tokens)	1024 only
Voyage AI	voyage-4-lite	1,000 inputs	256, 512, 1024, 2048
Ollama	embeddinggemma (default)	No API limit	128, 256, 512, 768 (default 512)

embeddinggemma supports Matryoshka (MRL) truncation at 128, 256, 512, and 768 dimensions. 512 is the default – best trade-off between speed and quality. Set EMBEDDING_DIM=768 if you want the full native size. Other Ollama models have different dimensions – mxbai-embed-large does 1024, for instance. Set OLLAMA_EMBED_MODEL to switch.

Watch out for Mistral: mistral-embed will flat-out reject the output_dimension parameter. If you want smaller vectors, use OpenAI or Voyage instead.

Dimension reduction (OpenAI 512-dim vs 768-dim, M1) #

OpenAI’s text-embedding-3-small supports Matryoshka truncation – you ask for fewer dimensions and it truncates server-side. We wanted to know if 512 dimensions would hurt search quality, so we tested it against our 768-dim baseline on the same 4,300-memory production database.

Changing dimensions is more than an .env edit. pgvector locks the dimension into the column type, so you need a migration. We ship a generator script that handles the SQL for you.

Re-embed performance #

Dimension	Memories	Time	Batch size
768	4,300	<1 min	500/call
512	4,300	<1 min	500/call

Both finish in under a minute. re_embed_all batches 500 texts per OpenAI API call. An earlier version sent one text per call – re-embedding at 1024 dimensions took over an hour before we fixed that.

Stress test (1000 memories, M1 + OpenAI, 512-dim) #

Phase	768-dim	512-dim
Cold import (no dedup)	24.1 ms/memory	12.5 ms/memory
Re-import (dedup)	4.6 ms/memory	4.4 ms/memory
Auto-link backfill	8 links, 1.9s	9 links, 0.6s
Hybrid search	104 ms mean	132 ms mean
Explore graph	103 ms mean	108 ms mean
Get related	83 ms mean	79 ms mean

Cold imports run about 2x faster at 512 – less data per vector, faster HNSW indexing. Search and graph latency didn’t move. Those numbers are still network round-trips to Supabase Cloud, not vector math.

Similarity scores land a bit higher at 512 (top results at 0.55-0.77 vs 0.5-0.6 at 768). Vectors packed into fewer dimensions cluster tighter. DEFAULT_MATCH_THRESHOLD=0.35 still works without adjustment.

What we learned #

Two things matter: where your embeddings come from, and where your database lives.

For imports, embeddings are the bottleneck. All three cloud providers beat local Ollama on both machines. OpenAI and Voyage at 512 dimensions come in at ~36ms/memory – nearly identical. Mistral is slower because of its small batch limit, but it finds far more auto-links than the others if that matters to you.

For search and graph queries, the database location matters more than we expected. Moving from Supabase Cloud to self-hosted PostgreSQL cut search latency by 5-8x. Neon lands between the two – graph queries dropped to 70-80ms by cutting out the PostgREST layer, and there’s no server to run. If your AI clients are recalling memories in conversation, the difference is tangible – 16ms feels instant, 110ms adds a beat of lag to every tool call.

Ollama on the N305 works, but at 3-4 seconds per embedding it’s only practical for one-at-a-time use. For bulk imports, pick a cloud provider – any of the three will do. See switching providers.

The N305 also thermal-throttled. Per-memory cost climbed from 2.8s at 20 memories to 4.4s at 50, a 57% increase from heat buildup. Worth knowing if you’re running sustained inference on fanless hardware.

A lighter Ollama model (nomic-embed-text, all-minilm at 384 dimensions) would run faster on CPU, but you’d need to re-embed everything and resize the pgvector column. Depends on how much keeping things local matters to you.

The M1’s 16 GB unified memory is shared between everything. A machine with 32 GB, or a dedicated GPU with its own VRAM, would give Ollama more room and likely close some of the gap with OpenAI.

Running benchmarks #

Operational benchmarks #

uv run python tests/bench.py              # measure embedding, store, search latency
uv run python tests/bench.py --json       # machine-readable output
uv run python tests/bench_stress.py --count 1000  # stress test with 1000 memories

LongMemEval retrieval benchmark #

Warning: The full 500-question benchmark ingests 124K+ memories (~1.2GB). Don’t run this against a free-tier database – use a local Docker Postgres (docker run -d --name ogham-postgres -p 5432:5432 pgvector/pgvector:pg17) or a paid instance.

# Download the dataset
python benchmarks/longmemeval_benchmark.py --download

# Run all 500 questions
python benchmarks/longmemeval_benchmark.py --top-k 10

# Run a single category for fast iteration (~13 minutes)
python benchmarks/longmemeval_benchmark.py --question-type temporal-reasoning

You’ll need a database ( Supabase, Neon, or self-hosted PostgreSQL) and an embedding provider – Ollama for local, or an API key for OpenAI, Mistral, or Voyage.