Benchmarks
Table of Contents
97.2% Recall@10 on LongMemEval #
Tested against the LongMemEval benchmark (Wu et al., ICLR 2025) – 500 questions across 5 memory abilities, 124,342 memories ingested. No LLM in the retrieval pipeline. One PostgreSQL query.
| Category | R@10 | MRR | Questions |
|---|---|---|---|
| single-session-assistant | 100% | 100% | 56 |
| knowledge-update | 100% | 97.4% | 78 |
| single-session-user | 98.6% | 89.8% | 70 |
| multi-session | 97.3% | 90.2% | 133 |
| single-session-preference | 96.7% | 87.5% | 30 |
| temporal-reasoning | 93.5% | 85.9% | 133 |
| Overall | 97.2% | 91.1% | 500 |
How other systems score #
Most memory systems report end-to-end QA accuracy (retrieval + LLM reads and answers). That measures the whole pipeline, not just retrieval.
| System | QA Accuracy | Architecture |
|---|---|---|
| OMEGA | 95.4% | Classification + extraction pipeline |
| Observational Memory (Mastra) | 94.9% | Observation extraction + GPT-5-mini |
| Hindsight (Vectorize) | 91.4% | 4 memory types + Gemini-3 |
| Zep (Graphiti) | 71.2% | Temporal knowledge graph + GPT-4o |
| Mem0 | 49.0% | RAG-based |
Ogham’s 97.2% is retrieval R@10 – did the correct session appear in the top 10 results, with no LLM interpreting the content. The LongMemEval paper reports 78.4% as their best retrieval baseline. Other retrieval systems that report similar R@10 numbers typically use cross-encoder reranking, NLI verification, knowledge graph enrichment, and LLM-as-a-judge pipelines. Ogham reaches 97.2% with one Postgres query.
What Ogham does differently #
- Hybrid search combining pgvector cosine similarity with tsvector keyword matching via CCF (Convex Combination Fusion)
- Entity-centric bridge retrieval for multi-hop temporal queries (“how many months between X and Y”)
- Gaussian decay temporal re-ranking with directional penalty
- halfvec HNSW index (float16 compression, roughly half the size of float32)
- No neural rerankers, no knowledge graph enrichment, no query expansion, no LLM calls during search
Benchmark setup #
- Embedding: Voyage AI voyage-4-lite at 512 dimensions
- Index: halfvec HNSW (float16)
- Database: PostgreSQL 17 with pgvector 0.8.2
- Cache hit ratio during benchmark: 97.5%
- CPU usage: under 4%
- Zero errors across all 500 questions
Operational performance #
Latency, throughput, and provider comparisons across Supabase Cloud, Neon serverless, and a self-hosted PostgreSQL instance, tested on an M1 Mac and a low-power Intel NAS with local and remote embedding providers.
Before you read the tables #
The short version: self-hosting your database makes search fast. Hybrid search goes from ~110ms on Supabase Cloud to ~23ms on a local PostgreSQL instance. That 65-110ms we kept seeing on every machine? Network overhead. The database itself is quick.
A fanless Intel NAS imports 1000 memories in 33 seconds with OpenAI embeddings. Without a GPU, local embeddings take 3-4 seconds each – fine for one-at-a-time use, painful for bulk imports.
If you have a GPU, Ollama keeps everything local. If you don’t, using a cloud embedding provider (OpenAI, Mistral, or Voyage AI) is faster than local inference on both machines we tested.
Test platforms #
| Apple M1 (Mac) | Intel N305 (NAS) | |
|---|---|---|
| CPU | 4P + 4E cores | 8 E-cores (Alder Lake-N) |
| GPU acceleration | Metal (Neural Engine) | None |
| Ollama inference | GPU | CPU |
| RAM | 16 GB unified | 4 GB DDR5 |
The N305 is a low-power NAS chip. Ollama has no GPU path for it – no CUDA, no ROCm, no Metal – so every embedding is pure CPU work.
Resource footprint #
| What | Size |
|---|---|
| Docker image | 243 MB |
| Model download (once) | ~600 MB |
| Model in GPU memory (Metal) | <200 MB |
| Embedding, first call (M1) | ~500ms |
| Embedding, warm (M1) | 70-120ms |
On Apple Silicon, the model sits in GPU memory via Metal. Ollama unloads it after 5 minutes idle, so the next call takes ~500ms to reload.
Baseline latency (10 memories, M1 + Ollama) #
Run with tests/bench.py:
| Operation | Mean | Median |
|---|---|---|
| Embedding (cold) | ~90 ms | ~85 ms |
| Embedding (cached) | <0.1 ms | <0.1 ms |
| Store memory | ~65 ms | ~64 ms |
| Vector search (DB only) | ~65 ms | ~64 ms |
| Hybrid search (DB only) | ~65 ms | ~64 ms |
| Auto-link (HNSW scan) | ~65 ms | ~65 ms |
| Explore graph (search + CTE) | ~69 ms | ~69 ms |
| Get related (CTE traversal) | ~65 ms | ~64 ms |
Search timings exclude embedding generation to show database latency on its own. Graph operations add almost nothing on top of base search – the recursive CTEs are essentially free at this scale.
Stress test (1000 memories, M1 + Ollama) #
Run with tests/bench_stress.py. Imports real conversation memories from a 4,588-memory export, then benchmarks operations at scale. The SQLite embedding cache is cleared first so the cold import measures true embedding latency.
| Phase | Result |
|---|---|
| Cold import (no dedup) | 58.6s total, 58.6 ms/memory |
| Re-import (dedup) | 2.7s total, 2.7 ms/memory – 1000/1000 skipped |
| Auto-link backfill | 7 memories linked in 4 batches, 1.5s total |
| Hybrid search | 117 ms mean, 109 ms median |
| Explore graph (search + CTE) | 110 ms mean, 112 ms median |
| Get related (CTE traversal) | 85 ms mean, 83 ms median |
At the default 0.85 similarity threshold, auto-linking is conservative – only 7 out of 1000 conversational memories were similar enough to link. Dropping to 0.7 found 52 links across 4,242 memories in production. The threshold is configurable per call.
Cross-platform comparison #
We tested four configurations to separate hardware performance from network latency.
Embedding throughput (cold, per memory) #
| Config | 10 memories | 20 memories | 50 memories | 1000 memories |
|---|---|---|---|---|
| M1 + OpenAI | – | – | – | 24.1 ms |
| N305 + OpenAI | – | – | – | 33.2 ms |
| M1 + Ollama | ~58 ms | ~58 ms | ~58 ms | 58.6 ms |
| N305 + Ollama | 3,160 ms | 2,827 ms | 4,380 ms | – |
M1 + Ollama stays consistent regardless of batch size. The N305 running Ollama gets worse the longer it runs – likely thermal throttling on those fanless E-cores. At 50 memories, the cold import takes 3m 39s compared to about 3 seconds on the M1. We gave up at 50 on N305 + Ollama. Extrapolating from the 4.4s/memory average, 1000 memories would have taken over an hour.
OpenAI comes in at 24ms/memory on the M1 and 33ms on the N305 – faster than Ollama on either machine. Not shocking given that OpenAI’s inference servers have considerably more compute than a 2020 Mac, but worth confirming: a network round-trip to their API is still quicker than running the model locally.
Stress test with OpenAI (1000 memories) #
| Phase | M1 + OpenAI | N305 + OpenAI |
|---|---|---|
| Cold import (no dedup) | 24.1s, 24.1 ms/memory | 33.2s, 33.2 ms/memory |
| Re-import (dedup) | 4.6s, 4.6 ms/memory | 4.4s, 4.4 ms/memory |
| Auto-link backfill | 8 links, 2 batches, 1.9s | 11 links, 2 batches, 0.5s |
| Hybrid search | 104 ms mean | 113 ms mean |
| Explore graph | 103 ms mean | 79 ms mean |
| Get related | 83 ms mean | 68 ms mean |
Search and graph operations #
| Operation | M1 + Ollama | N305 + Ollama (50) | M1 + OpenAI | N305 + OpenAI |
|---|---|---|---|---|
| Hybrid search | 117 ms | 106 ms | 104 ms | 113 ms |
| Explore graph | 110 ms | 97 ms | 103 ms | 79 ms |
| Get related | 85 ms | 65 ms | 83 ms | 68 ms |
All runs are 1000 memories except N305 + Ollama (50, too slow to go higher). These numbers all land in the same 65-117ms range regardless of configuration. That’s Supabase network latency, not hardware.
Self-hosted PostgreSQL (M1 + Ollama, 1000 memories) #
Same stress test, but against a self-hosted PostgreSQL + pgvector instance on the local network – PostgREST running in an LXC container, no Supabase Cloud in the loop.
| Phase | Supabase Cloud | Self-hosted | Speedup |
|---|---|---|---|
| Cold import (no dedup) | 58.6 ms/memory | 44.1 ms/memory | 1.3x |
| Re-import (dedup) | 2.7 ms/memory | 1.2 ms/memory | 2.3x |
| Auto-link backfill | 7 links, 1.5s | 22 links, 1.1s | – |
| Hybrid search | 117 ms mean | 23 ms mean | 5.1x |
| Explore graph | 110 ms mean | 16 ms mean | 6.9x |
| Get related | 85 ms mean | 10 ms mean | 8.5x |
That’s the number that jumped out. What we thought was database latency was almost entirely network round-trips. The actual HNSW scan plus recursive CTE traversal? 5-23ms.
The self-hosted run found more auto-links (22 vs 7) due to PostgREST schema cache timing differences during the run – not meaningful, just a side effect of the test setup.
PostgreSQL 17 with pgvector in an LXC container, PostgREST as the API layer, M1 Mac running Ollama for embeddings. The BYODB guide walks through the setup.
Neon serverless (M1 + OpenAI, 1000 memories) #
Same stress test against Neon serverless Postgres (EU-Central-1, Frankfurt). Ogham connects via psycopg through Neon’s connection pooler – no PostgREST, no Supabase client library in the path.
We ran this as a destruction test: dropped the entire schema, re-applied schema_postgres.sql from scratch, then ran the full benchmark on an empty database.
| Phase | Supabase Cloud | Neon (Frankfurt) | Difference |
|---|---|---|---|
| Cold import (no dedup) | 24.1 ms/memory | 37.8 ms/memory | Neon slower (embedding cache state) |
| Re-import (dedup) | 4.6 ms/memory | 3.7 ms/memory | 1.2x faster |
| Auto-link backfill | 8 links, 1.9s | 16 links, 1.5s | – |
| Hybrid search | 104 ms mean | 105 ms mean | Same |
| Explore graph | 103 ms mean | 80 ms mean | 1.3x faster |
| Get related | 83 ms mean | 71 ms mean | 1.2x faster |
The cold import number is higher because the embedding cache was in a different state between runs – import speed is dominated by embedding generation, not database writes. Search and graph queries are the more useful comparison, and there Neon was comparable to or faster than Supabase Cloud.
Graph operations improved the most. explore_knowledge dropped from 103ms to 80ms, and get_related from 83ms to 71ms. Both skip the PostgREST serialization layer and go straight to Postgres, so the recursive CTE results come back faster.
Two things explain the difference. Neon’s Frankfurt region is closer to our test machine than Supabase’s AWS Ireland. And the direct psycopg connection cuts out PostgREST – one fewer hop between the application and the database.
Self-hosted PostgreSQL on the local network is still faster at 5-23ms. Neon lands between Supabase Cloud and self-hosted.
Embedding provider comparison (Neon, M1, 500 memories) #
Same stress test, all four embedding providers, all against Neon. Cloud providers hit their APIs over the network; Ollama ran locally on the M1 with Metal acceleration. Each run wipes everything and starts from scratch – fresh data, cleared embedding cache, 500 memories imported cold.
Mistral’s mistral-embed is locked at 1024 dimensions. Voyage’s voyage-4-lite gives you a few fixed sizes (256, 512, 1024, 2048). OpenAI takes anything up to 1536. Ollama’s embeddinggemma supports 128, 256, 512, and 768 via MRL truncation. We ran 512 and 1024 where we could.
| Metric | Ollama 512 | OpenAI 512 | OpenAI 1024 | Mistral 1024 | Voyage 512 | Voyage 1024 |
|---|---|---|---|---|---|---|
| Cold import | 85.0 ms/mem | 36.3 ms/mem | 39.6 ms/mem | 50.8 ms/mem | 36.6 ms/mem | 40.1 ms/mem |
| Dedup re-import | 2.2 ms/mem | 2.9 ms/mem | 5.2 ms/mem | 5.5 ms/mem | 2.7 ms/mem | 5.9 ms/mem |
| Auto-links (0.85) | 29 | 16 | 13 | 468 | 62 | 63 |
| Hybrid search | 92.4 ms | 97.1 ms | 109.4 ms | 110.0 ms | 95.4 ms | 117.1 ms |
| Explore graph | 85.4 ms | 79.8 ms | 81.3 ms | 101.5 ms | 80.6 ms | 91.1 ms |
| Get related | 71.9 ms | 71.3 ms | 75.8 ms | 71.7 ms | 72.4 ms | 71.3 ms |
Import speed #
OpenAI and Voyage at 512 dims are essentially tied – 36ms per memory. Mistral trails at 50.8ms because it caps you at about 32 texts per request (16K token limit), so you’re making 15x more API calls than OpenAI or Voyage, which take 500+ texts in one go. Ollama at 85ms per memory is the slowest – embeddings run locally, not on a cloud GPU cluster. This was on an M1 Mac with Metal acceleration; expect slower on CPU-only hardware. No API cost though, and nothing leaves your machine.
Search latency #
Halving the dimensions from 1024 to 512 cuts 15-20% off search times. Ollama’s 92ms hybrid search edged out the cloud providers, but Voyage at 95ms and OpenAI at 97ms are close enough that it doesn’t matter in practice. At 1024, all three cloud providers sit in the 109-117ms range – you’re measuring Neon round-trips at that point, not vector math.
Auto-linking behaviour #
This was the surprise. The same 0.85 cosine similarity threshold produces wildly different results depending on the provider:
- Mistral found 468 links. Its vectors pack related content close together – the embedding space is tight.
- Voyage found 62-63 links, same at both 512 and 1024 dims.
- Ollama landed at 29 links – moderate spread, sitting between Voyage and OpenAI.
- OpenAI found 13-16 links. Widest spread of the four, which lines up with what we saw switching from Ollama to OpenAI – a threshold that worked for one provider found almost nothing with another.
If you want a similar density of auto-links across providers, tune the threshold: around 0.65-0.70 for OpenAI, 0.75-0.80 for Ollama, 0.70-0.75 for Voyage, and 0.85 works as-is for Mistral. It’s configurable per call.
Provider specs #
| Provider | Model | Batch limit | Dimensions |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 2,048 inputs | Any up to 1536 |
| Mistral | mistral-embed | ~32 inputs (16K tokens) | 1024 only |
| Voyage AI | voyage-4-lite | 1,000 inputs | 256, 512, 1024, 2048 |
| Ollama | embeddinggemma (default) | No API limit | 128, 256, 512, 768 (default 512) |
embeddinggemma supports Matryoshka (MRL) truncation at 128, 256, 512, and 768 dimensions. 512 is the default – best trade-off between speed and quality. Set EMBEDDING_DIM=768 if you want the full native size. Other Ollama models have different dimensions – mxbai-embed-large does 1024, for instance. Set OLLAMA_EMBED_MODEL to switch.
Watch out for Mistral: mistral-embed will flat-out reject the output_dimension parameter. If you want smaller vectors, use OpenAI or Voyage instead.
Dimension reduction (OpenAI 512-dim vs 768-dim, M1) #
OpenAI’s text-embedding-3-small supports Matryoshka truncation – you ask for fewer dimensions and it truncates server-side. We wanted to know if 512 dimensions would hurt search quality, so we tested it against our 768-dim baseline on the same 4,300-memory production database.
Changing dimensions is more than an .env edit. pgvector locks the dimension into the column type, so you need a migration. We ship
a generator script that handles the SQL for you.
Re-embed performance #
| Dimension | Memories | Time | Batch size |
|---|---|---|---|
| 768 | 4,300 | <1 min | 500/call |
| 512 | 4,300 | <1 min | 500/call |
Both finish in under a minute. re_embed_all batches 500 texts per OpenAI API call. An earlier version sent one text per call – re-embedding at 1024 dimensions took over an hour before we fixed that.
Stress test (1000 memories, M1 + OpenAI, 512-dim) #
| Phase | 768-dim | 512-dim |
|---|---|---|
| Cold import (no dedup) | 24.1 ms/memory | 12.5 ms/memory |
| Re-import (dedup) | 4.6 ms/memory | 4.4 ms/memory |
| Auto-link backfill | 8 links, 1.9s | 9 links, 0.6s |
| Hybrid search | 104 ms mean | 132 ms mean |
| Explore graph | 103 ms mean | 108 ms mean |
| Get related | 83 ms mean | 79 ms mean |
Cold imports run about 2x faster at 512 – less data per vector, faster HNSW indexing. Search and graph latency didn’t move. Those numbers are still network round-trips to Supabase Cloud, not vector math.
Similarity scores land a bit higher at 512 (top results at 0.55-0.77 vs 0.5-0.6 at 768). Vectors packed into fewer dimensions cluster tighter. DEFAULT_MATCH_THRESHOLD=0.35 still works without adjustment.
What we learned #
Two things matter: where your embeddings come from, and where your database lives.
For imports, embeddings are the bottleneck. All three cloud providers beat local Ollama on both machines. OpenAI and Voyage at 512 dimensions come in at ~36ms/memory – nearly identical. Mistral is slower because of its small batch limit, but it finds far more auto-links than the others if that matters to you.
For search and graph queries, the database location matters more than we expected. Moving from Supabase Cloud to self-hosted PostgreSQL cut search latency by 5-8x. Neon lands between the two – graph queries dropped to 70-80ms by cutting out the PostgREST layer, and there’s no server to run. If your AI clients are recalling memories in conversation, the difference is tangible – 16ms feels instant, 110ms adds a beat of lag to every tool call.
Ollama on the N305 works, but at 3-4 seconds per embedding it’s only practical for one-at-a-time use. For bulk imports, pick a cloud provider – any of the three will do. See switching providers.
The N305 also thermal-throttled. Per-memory cost climbed from 2.8s at 20 memories to 4.4s at 50, a 57% increase from heat buildup. Worth knowing if you’re running sustained inference on fanless hardware.
A lighter Ollama model (nomic-embed-text, all-minilm at 384 dimensions) would run faster on CPU, but you’d need to re-embed everything and resize the pgvector column. Depends on how much keeping things local matters to you.
The M1’s 16 GB unified memory is shared between everything. A machine with 32 GB, or a dedicated GPU with its own VRAM, would give Ollama more room and likely close some of the gap with OpenAI.
Running benchmarks #
Operational benchmarks #
uv run python tests/bench.py # measure embedding, store, search latency
uv run python tests/bench.py --json # machine-readable output
uv run python tests/bench_stress.py --count 1000 # stress test with 1000 memories
LongMemEval retrieval benchmark #
Warning: The full 500-question benchmark ingests 124K+ memories (~1.2GB). Don’t run this against a free-tier database – use a local Docker Postgres (
docker run -d --name ogham-postgres -p 5432:5432 pgvector/pgvector:pg17) or a paid instance.
# Download the dataset
python benchmarks/longmemeval_benchmark.py --download
# Run all 500 questions
python benchmarks/longmemeval_benchmark.py --top-k 10
# Run a single category for fast iteration (~13 minutes)
python benchmarks/longmemeval_benchmark.py --question-type temporal-reasoning
You’ll need a database ( Supabase, Neon, or self-hosted PostgreSQL) and an embedding provider – Ollama for local, or an API key for OpenAI, Mistral, or Voyage.