qmd - Local Markdown Search Engine for Token-Efficient Agentic Retrieval

Research Date: 2026-01-28 Source URL: https://x.com/andrarchy/status/2015783856087929254

Reference URLs

Summary

qmd is an open-source, local-first markdown search engine created by Tobias Lütke (CEO of Shopify). The tool addresses a significant inefficiency in agentic AI workflows: when AI assistants search large document collections, traditional approaches of grep-and-read-whole-files consume excessive context window tokens.

Andrew Levine reported a 96% token reduction after integrating qmd with his personal AI assistant (clawdbot). His 600+ note Obsidian vault previously required approximately 15,000 tokens for a simple search query. With qmd, the same operation uses roughly 600 tokens by returning targeted snippets with relevance scores rather than entire documents.

qmd combines three search techniques: BM25 full-text search via SQLite FTS5, vector semantic search using local embedding models, and LLM-based re-ranking for quality sorting. All processing runs locally without cloud dependencies.

Token Economics Problem

The Naive Search Pattern

AI coding assistants and personal knowledge assistants frequently need to search user document collections. The straightforward approach involves:

Execute grep or ripgrep to find files containing query terms
Read matched files in their entirety to provide context
Process the full file contents to extract relevant information

For a vault with 600+ markdown files, this pattern produces substantial token consumption even for simple queries. Andrew Levine measured approximately 15,000 tokens per search operation using this approach with his clawdbot AI assistant.

Token Consumption Breakdown

Consider a query like “what did I write about project X”:

Operation	Estimated Tokens
Grep command and output	200-500
Reading 5-10 matched files, avg 1500 tokens each	7,500-15,000
Agent reasoning overhead	500-1,000
Total	8,200-16,500

With qmd’s snippet-based retrieval:

Operation	Estimated Tokens
Search query	10-20
5 result snippets with metadata	300-500
Agent reasoning overhead	100-200
Total	410-720

The 96% reduction reported by Levine aligns with these estimates.

Technical Architecture

Hybrid Search Pipeline

qmd implements a sophisticated retrieval pipeline combining multiple search strategies:

flowchart TD UserQuery[User Query] --> QueryExpansion[Query Expansion via Qwen3-1.7B] UserQuery --> OriginalQuery[Original Query with 2x Weight] QueryExpansion --> ExpandedQueries[2 Alternative Queries] OriginalQuery --> BM25Original[BM25 Search] OriginalQuery --> VectorOriginal[Vector Search] ExpandedQueries --> BM25Expanded[BM25 Search] ExpandedQueries --> VectorExpanded[Vector Search] BM25Original --> RRFFusion[RRF Fusion] VectorOriginal --> RRFFusion BM25Expanded --> RRFFusion VectorExpanded --> RRFFusion RRFFusion --> Top30[Top 30 Candidates] Top30 --> LLMRerank[LLM Re-ranking via Qwen3-reranker] LLMRerank --> PositionBlend[Position-Aware Blending] PositionBlend --> FinalResults[Final Ranked Results]

Search Backend Components

BM25 Full-Text Search via SQLite FTS5

Traditional keyword matching using the BM25 ranking algorithm. SQLite’s FTS5 extension provides efficient indexing and query execution. Raw scores range from 0 to approximately 25+, converted via absolute value.

Vector Semantic Search

Documents are chunked into 800-token segments with 15% overlap. Each chunk is embedded using the EmbeddingGemma-300M model. Search queries are embedded and matched against the vector index using cosine distance, converted to similarity via 1 / (1 + distance).

LLM Re-ranking

Top candidates from initial retrieval pass through the Qwen3-reranker model. The re-ranker evaluates each document against the query and assigns a relevance score (0-10 scale, normalized to 0-1).

Reciprocal Rank Fusion (RRF)

qmd merges results from multiple search paths using RRF with modifications:

score = Σ(1/(k+rank+1)) where k=60

Additional bonuses:

Original query results receive 2x weight
Documents ranking first in any list receive +0.05 bonus
Documents ranking second or third receive +0.02 bonus

Position-Aware Blending

Final scores blend retrieval and re-ranker scores based on RRF rank:

RRF Rank	Retrieval Weight	Reranker Weight
1-3	75%	25%
4-10	60%	40%
11+	40%	60%

This approach preserves exact matches (which retrieval excels at) while allowing the re-ranker to improve ordering for lower-ranked results.

Local Model Stack

qmd uses three GGUF models downloaded automatically on first use:

Model	Purpose	Size
embeddinggemma-300M-Q8_0	Vector embeddings	~300MB
qwen3-reranker-0.6b-q8_0	Re-ranking	~640MB
Qwen3-1.7B-Q8_0	Query expansion	~2.2GB

Models run locally via node-llama-cpp with GGUF format. Total model storage is approximately 3.1GB, cached in ~/.cache/qmd/models/.

Embedding Prompt Format

Queries and documents use distinct prompt formats for the embedding model:

// Query embedding
"task: search result | query: {query}"

// Document embedding  
"title: {title} | text: {content}"

CLI Usage and Agent Integration

Collection Setup

# Install globally via Bun
bun install -g https://github.com/tobi/qmd

# Index directories as named collections
qmd collection add ~/notes --name notes
qmd collection add ~/Documents/meetings --name meetings

# Add context descriptions to aid search
qmd context add qmd://notes "Personal notes and ideas"
qmd context add qmd://meetings "Meeting transcripts and notes"

# Generate vector embeddings
qmd embed

Search Commands

qmd offers three search modes with different tradeoffs:

Command	Method	Speed	Quality
search	BM25 only	Fast	Good
vsearch	Vector only	Fast	Good
query	Hybrid with expansion and re-ranking	Slow	Best

# Keyword search
qmd search "project timeline"

# Semantic search
qmd vsearch "how to deploy"

# Full hybrid pipeline
qmd query "quarterly planning process"

Agent-Optimized Output Formats

qmd provides structured output formats designed for LLM consumption:

# JSON for programmatic parsing
qmd search "authentication" --json -n 10

# File list with scores
qmd query "error handling" --all --files --min-score 0.4

# Retrieve specific document
qmd get "docs/api-reference.md" --full

MCP Server Integration

qmd exposes an MCP (Model Context Protocol) server for direct integration with Claude Desktop and Claude Code:

Claude Desktop Configuration (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

Claude Code Configuration (~/.claude/settings.json):

{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

MCP tools exposed:

qmd_search - BM25 keyword search
qmd_vsearch - Vector semantic search
qmd_query - Hybrid search with re-ranking
qmd_get - Retrieve document by path or ID
qmd_multi_get - Retrieve multiple documents by pattern
qmd_status - Index health and collection information

Data Storage

qmd stores its index in SQLite at ~/.cache/qmd/index.sqlite. The schema includes:

Table	Purpose
collections	Indexed directories with name and glob patterns
path_contexts	Context descriptions by virtual path
documents	Markdown content with metadata and 6-char docid
documents_fts	FTS5 full-text index
content_vectors	Embedding chunks with hash, sequence, position
vectors_vec	sqlite-vec vector index
llm_cache	Cached LLM responses for query expansion, re-ranking

The document ID system uses 6-character hashes, enabling retrieval by ID (e.g., qmd get "#abc123") without knowing the file path.

Comparison with Alternative Approaches

Grep and Read Pattern

The baseline approach most AI assistants use. Grep finds files, then the agent reads entire files. Simple to implement but token-inefficient for large collections.

RAG Systems

Traditional Retrieval-Augmented Generation systems chunk documents and retrieve via vector search. qmd improves on basic RAG through:

Hybrid retrieval (BM25 + vector) for better recall
Query expansion for improved coverage
Re-ranking for precision improvement
Local execution without API calls or cloud dependencies

Commercial Solutions

Services like Obsidian Copilot or Notion AI provide search capabilities but require cloud processing and have usage limits. qmd maintains data locality and has no per-query costs after initial model download.

Practical Deployment Considerations

Hardware Requirements

Bun runtime >= 1.0.0
macOS with Homebrew SQLite (for extension support)
Approximately 4GB RAM for model loading
3.1GB disk space for models
SSD recommended for index performance

Index Maintenance

# Check index status
qmd status

# Re-index after file changes
qmd update

# Re-index with git pull for remote repos
qmd update --pull

# Clean orphaned data
qmd cleanup

Performance Characteristics

Initial embedding generation scales linearly with document count. The 800-token chunk size with 15% overlap balances retrieval granularity against embedding cost. Query latency depends on mode:

search (BM25): ~10-50ms
vsearch (vector): ~50-200ms
query (hybrid + re-ranking): ~500-2000ms

The query command’s latency comes primarily from LLM operations (query expansion and re-ranking).

Use Case: Personal Knowledge Assistant

Andrew Levine’s configuration demonstrates the primary use case. His setup combines:

Obsidian vault with 600+ markdown notes
Clawdbot (Moltbot) as the AI assistant framework
qmd as the search backend via CLI or MCP

When Levine asks his assistant “what did I write about X?”, the workflow becomes:

Assistant invokes qmd query "X" --json -n 5
qmd returns 5 snippets with relevance scores and metadata
Assistant synthesizes response from snippets
Total token cost: approximately 600 tokens vs 15,000 previously

The 96% reduction enables more interactive querying without exhausting context windows or API budgets.

Key Findings

qmd achieves 96% token reduction for document search operations in agentic workflows
Hybrid search combining BM25, vector embeddings, and LLM re-ranking outperforms single-method approaches
Local execution using quantized GGUF models eliminates API costs and maintains data privacy
MCP server integration enables direct Claude Desktop and Claude Code usage
The tool is particularly effective for personal knowledge bases like Obsidian vaults
Position-aware blending preserves exact match quality while improving overall ranking

References

qmd GitHub Repository - Accessed 2026-01-28
Andrew Levine’s X Post - 2026-01-28
Clawdbot Documentation - Accessed 2026-01-28
Model Context Protocol - Accessed 2026-01-28
LinkedIn discussion on qmd - Accessed 2026-01-28