qmd - Local Markdown Search Engine for Token-Efficient Agentic Retrieval
Research Date: 2026-01-28 Source URL: https://x.com/andrarchy/status/2015783856087929254
Reference URLs
Summary
qmd is an open-source, local-first markdown search engine created by Tobias Lütke (CEO of Shopify). The tool addresses a significant inefficiency in agentic AI workflows: when AI assistants search large document collections, traditional approaches of grep-and-read-whole-files consume excessive context window tokens.
Andrew Levine reported a 96% token reduction after integrating qmd with his personal AI assistant (clawdbot). His 600+ note Obsidian vault previously required approximately 15,000 tokens for a simple search query. With qmd, the same operation uses roughly 600 tokens by returning targeted snippets with relevance scores rather than entire documents.
qmd combines three search techniques: BM25 full-text search via SQLite FTS5, vector semantic search using local embedding models, and LLM-based re-ranking for quality sorting. All processing runs locally without cloud dependencies.
Token Economics Problem
The Naive Search Pattern
AI coding assistants and personal knowledge assistants frequently need to search user document collections. The straightforward approach involves:
- Execute grep or ripgrep to find files containing query terms
- Read matched files in their entirety to provide context
- Process the full file contents to extract relevant information
For a vault with 600+ markdown files, this pattern produces substantial token consumption even for simple queries. Andrew Levine measured approximately 15,000 tokens per search operation using this approach with his clawdbot AI assistant.
Token Consumption Breakdown
Consider a query like “what did I write about project X”:
| Operation | Estimated Tokens |
|---|---|
| Grep command and output | 200-500 |
| Reading 5-10 matched files, avg 1500 tokens each | 7,500-15,000 |
| Agent reasoning overhead | 500-1,000 |
| Total | 8,200-16,500 |
With qmd’s snippet-based retrieval:
| Operation | Estimated Tokens |
|---|---|
| Search query | 10-20 |
| 5 result snippets with metadata | 300-500 |
| Agent reasoning overhead | 100-200 |
| Total | 410-720 |
The 96% reduction reported by Levine aligns with these estimates.
Technical Architecture
Hybrid Search Pipeline
qmd implements a sophisticated retrieval pipeline combining multiple search strategies:
Search Backend Components
BM25 Full-Text Search via SQLite FTS5
Traditional keyword matching using the BM25 ranking algorithm. SQLite’s FTS5 extension provides efficient indexing and query execution. Raw scores range from 0 to approximately 25+, converted via absolute value.
Vector Semantic Search
Documents are chunked into 800-token segments with 15% overlap. Each chunk is embedded using the EmbeddingGemma-300M model. Search queries are embedded and matched against the vector index using cosine distance, converted to similarity via 1 / (1 + distance).
LLM Re-ranking
Top candidates from initial retrieval pass through the Qwen3-reranker model. The re-ranker evaluates each document against the query and assigns a relevance score (0-10 scale, normalized to 0-1).
Reciprocal Rank Fusion (RRF)
qmd merges results from multiple search paths using RRF with modifications:
score = Σ(1/(k+rank+1)) where k=60
Additional bonuses:
- Original query results receive 2x weight
- Documents ranking first in any list receive +0.05 bonus
- Documents ranking second or third receive +0.02 bonus
Position-Aware Blending
Final scores blend retrieval and re-ranker scores based on RRF rank:
| RRF Rank | Retrieval Weight | Reranker Weight |
|---|---|---|
| 1-3 | 75% | 25% |
| 4-10 | 60% | 40% |
| 11+ | 40% | 60% |
This approach preserves exact matches (which retrieval excels at) while allowing the re-ranker to improve ordering for lower-ranked results.
Local Model Stack
qmd uses three GGUF models downloaded automatically on first use:
| Model | Purpose | Size |
|---|---|---|
| embeddinggemma-300M-Q8_0 | Vector embeddings | ~300MB |
| qwen3-reranker-0.6b-q8_0 | Re-ranking | ~640MB |
| Qwen3-1.7B-Q8_0 | Query expansion | ~2.2GB |
Models run locally via node-llama-cpp with GGUF format. Total model storage is approximately 3.1GB, cached in ~/.cache/qmd/models/.
Embedding Prompt Format
Queries and documents use distinct prompt formats for the embedding model:
// Query embedding
"task: search result | query: {query}"
// Document embedding
"title: {title} | text: {content}"
CLI Usage and Agent Integration
Collection Setup
# Install globally via Bun
bun install -g https://github.com/tobi/qmd
# Index directories as named collections
qmd collection add ~/notes --name notes
qmd collection add ~/Documents/meetings --name meetings
# Add context descriptions to aid search
qmd context add qmd://notes "Personal notes and ideas"
qmd context add qmd://meetings "Meeting transcripts and notes"
# Generate vector embeddings
qmd embed
Search Commands
qmd offers three search modes with different tradeoffs:
| Command | Method | Speed | Quality |
|---|---|---|---|
| search | BM25 only | Fast | Good |
| vsearch | Vector only | Fast | Good |
| query | Hybrid with expansion and re-ranking | Slow | Best |
# Keyword search
qmd search "project timeline"
# Semantic search
qmd vsearch "how to deploy"
# Full hybrid pipeline
qmd query "quarterly planning process"
Agent-Optimized Output Formats
qmd provides structured output formats designed for LLM consumption:
# JSON for programmatic parsing
qmd search "authentication" --json -n 10
# File list with scores
qmd query "error handling" --all --files --min-score 0.4
# Retrieve specific document
qmd get "docs/api-reference.md" --full
MCP Server Integration
qmd exposes an MCP (Model Context Protocol) server for direct integration with Claude Desktop and Claude Code:
Claude Desktop Configuration (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"qmd": {
"command": "qmd",
"args": ["mcp"]
}
}
}
Claude Code Configuration (~/.claude/settings.json):
{
"mcpServers": {
"qmd": {
"command": "qmd",
"args": ["mcp"]
}
}
}
MCP tools exposed:
qmd_search- BM25 keyword searchqmd_vsearch- Vector semantic searchqmd_query- Hybrid search with re-rankingqmd_get- Retrieve document by path or IDqmd_multi_get- Retrieve multiple documents by patternqmd_status- Index health and collection information
Data Storage
qmd stores its index in SQLite at ~/.cache/qmd/index.sqlite. The schema includes:
| Table | Purpose |
|---|---|
| collections | Indexed directories with name and glob patterns |
| path_contexts | Context descriptions by virtual path |
| documents | Markdown content with metadata and 6-char docid |
| documents_fts | FTS5 full-text index |
| content_vectors | Embedding chunks with hash, sequence, position |
| vectors_vec | sqlite-vec vector index |
| llm_cache | Cached LLM responses for query expansion, re-ranking |
The document ID system uses 6-character hashes, enabling retrieval by ID (e.g., qmd get "#abc123") without knowing the file path.
Comparison with Alternative Approaches
Grep and Read Pattern
The baseline approach most AI assistants use. Grep finds files, then the agent reads entire files. Simple to implement but token-inefficient for large collections.
RAG Systems
Traditional Retrieval-Augmented Generation systems chunk documents and retrieve via vector search. qmd improves on basic RAG through:
- Hybrid retrieval (BM25 + vector) for better recall
- Query expansion for improved coverage
- Re-ranking for precision improvement
- Local execution without API calls or cloud dependencies
Commercial Solutions
Services like Obsidian Copilot or Notion AI provide search capabilities but require cloud processing and have usage limits. qmd maintains data locality and has no per-query costs after initial model download.
Practical Deployment Considerations
Hardware Requirements
- Bun runtime >= 1.0.0
- macOS with Homebrew SQLite (for extension support)
- Approximately 4GB RAM for model loading
- 3.1GB disk space for models
- SSD recommended for index performance
Index Maintenance
# Check index status
qmd status
# Re-index after file changes
qmd update
# Re-index with git pull for remote repos
qmd update --pull
# Clean orphaned data
qmd cleanup
Performance Characteristics
Initial embedding generation scales linearly with document count. The 800-token chunk size with 15% overlap balances retrieval granularity against embedding cost. Query latency depends on mode:
search(BM25): ~10-50msvsearch(vector): ~50-200msquery(hybrid + re-ranking): ~500-2000ms
The query command’s latency comes primarily from LLM operations (query expansion and re-ranking).
Use Case: Personal Knowledge Assistant
Andrew Levine’s configuration demonstrates the primary use case. His setup combines:
- Obsidian vault with 600+ markdown notes
- Clawdbot (Moltbot) as the AI assistant framework
- qmd as the search backend via CLI or MCP
When Levine asks his assistant “what did I write about X?”, the workflow becomes:
- Assistant invokes
qmd query "X" --json -n 5 - qmd returns 5 snippets with relevance scores and metadata
- Assistant synthesizes response from snippets
- Total token cost: approximately 600 tokens vs 15,000 previously
The 96% reduction enables more interactive querying without exhausting context windows or API budgets.
Key Findings
- qmd achieves 96% token reduction for document search operations in agentic workflows
- Hybrid search combining BM25, vector embeddings, and LLM re-ranking outperforms single-method approaches
- Local execution using quantized GGUF models eliminates API costs and maintains data privacy
- MCP server integration enables direct Claude Desktop and Claude Code usage
- The tool is particularly effective for personal knowledge bases like Obsidian vaults
- Position-aware blending preserves exact match quality while improving overall ranking
References
- qmd GitHub Repository - Accessed 2026-01-28
- Andrew Levine’s X Post - 2026-01-28
- Clawdbot Documentation - Accessed 2026-01-28
- Model Context Protocol - Accessed 2026-01-28
- LinkedIn discussion on qmd - Accessed 2026-01-28