qmd - Local Markdown Search Engine for Token-Efficient Agentic Retrieval

Research Date: 2026-01-28 Source URL: https://x.com/andrarchy/status/2015783856087929254

Reference URLs

Summary

qmd is an open-source, local-first markdown search engine created by Tobias Lütke (CEO of Shopify). The tool addresses a significant inefficiency in agentic AI workflows: when AI assistants search large document collections, traditional approaches of grep-and-read-whole-files consume excessive context window tokens.

Andrew Levine reported a 96% token reduction after integrating qmd with his personal AI assistant (clawdbot). His 600+ note Obsidian vault previously required approximately 15,000 tokens for a simple search query. With qmd, the same operation uses roughly 600 tokens by returning targeted snippets with relevance scores rather than entire documents.

qmd combines three search techniques: BM25 full-text search via SQLite FTS5, vector semantic search using local embedding models, and LLM-based re-ranking for quality sorting. All processing runs locally without cloud dependencies.

Token Economics Problem

The Naive Search Pattern

AI coding assistants and personal knowledge assistants frequently need to search user document collections. The straightforward approach involves:

  1. Execute grep or ripgrep to find files containing query terms
  2. Read matched files in their entirety to provide context
  3. Process the full file contents to extract relevant information

For a vault with 600+ markdown files, this pattern produces substantial token consumption even for simple queries. Andrew Levine measured approximately 15,000 tokens per search operation using this approach with his clawdbot AI assistant.

Token Consumption Breakdown

Consider a query like “what did I write about project X”:

OperationEstimated Tokens
Grep command and output200-500
Reading 5-10 matched files, avg 1500 tokens each7,500-15,000
Agent reasoning overhead500-1,000
Total8,200-16,500

With qmd’s snippet-based retrieval:

OperationEstimated Tokens
Search query10-20
5 result snippets with metadata300-500
Agent reasoning overhead100-200
Total410-720

The 96% reduction reported by Levine aligns with these estimates.

Technical Architecture

Hybrid Search Pipeline

qmd implements a sophisticated retrieval pipeline combining multiple search strategies:

Search Backend Components

BM25 Full-Text Search via SQLite FTS5

Traditional keyword matching using the BM25 ranking algorithm. SQLite’s FTS5 extension provides efficient indexing and query execution. Raw scores range from 0 to approximately 25+, converted via absolute value.

Documents are chunked into 800-token segments with 15% overlap. Each chunk is embedded using the EmbeddingGemma-300M model. Search queries are embedded and matched against the vector index using cosine distance, converted to similarity via 1 / (1 + distance).

LLM Re-ranking

Top candidates from initial retrieval pass through the Qwen3-reranker model. The re-ranker evaluates each document against the query and assigns a relevance score (0-10 scale, normalized to 0-1).

Reciprocal Rank Fusion (RRF)

qmd merges results from multiple search paths using RRF with modifications:

score = Σ(1/(k+rank+1)) where k=60

Additional bonuses:

  • Original query results receive 2x weight
  • Documents ranking first in any list receive +0.05 bonus
  • Documents ranking second or third receive +0.02 bonus

Position-Aware Blending

Final scores blend retrieval and re-ranker scores based on RRF rank:

RRF RankRetrieval WeightReranker Weight
1-375%25%
4-1060%40%
11+40%60%

This approach preserves exact matches (which retrieval excels at) while allowing the re-ranker to improve ordering for lower-ranked results.

Local Model Stack

qmd uses three GGUF models downloaded automatically on first use:

ModelPurposeSize
embeddinggemma-300M-Q8_0Vector embeddings~300MB
qwen3-reranker-0.6b-q8_0Re-ranking~640MB
Qwen3-1.7B-Q8_0Query expansion~2.2GB

Models run locally via node-llama-cpp with GGUF format. Total model storage is approximately 3.1GB, cached in ~/.cache/qmd/models/.

Embedding Prompt Format

Queries and documents use distinct prompt formats for the embedding model:

// Query embedding
"task: search result | query: {query}"

// Document embedding  
"title: {title} | text: {content}"

CLI Usage and Agent Integration

Collection Setup

# Install globally via Bun
bun install -g https://github.com/tobi/qmd

# Index directories as named collections
qmd collection add ~/notes --name notes
qmd collection add ~/Documents/meetings --name meetings

# Add context descriptions to aid search
qmd context add qmd://notes "Personal notes and ideas"
qmd context add qmd://meetings "Meeting transcripts and notes"

# Generate vector embeddings
qmd embed

Search Commands

qmd offers three search modes with different tradeoffs:

CommandMethodSpeedQuality
searchBM25 onlyFastGood
vsearchVector onlyFastGood
queryHybrid with expansion and re-rankingSlowBest
# Keyword search
qmd search "project timeline"

# Semantic search
qmd vsearch "how to deploy"

# Full hybrid pipeline
qmd query "quarterly planning process"

Agent-Optimized Output Formats

qmd provides structured output formats designed for LLM consumption:

# JSON for programmatic parsing
qmd search "authentication" --json -n 10

# File list with scores
qmd query "error handling" --all --files --min-score 0.4

# Retrieve specific document
qmd get "docs/api-reference.md" --full

MCP Server Integration

qmd exposes an MCP (Model Context Protocol) server for direct integration with Claude Desktop and Claude Code:

Claude Desktop Configuration (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

Claude Code Configuration (~/.claude/settings.json):

{
  "mcpServers": {
    "qmd": {
      "command": "qmd",
      "args": ["mcp"]
    }
  }
}

MCP tools exposed:

  • qmd_search - BM25 keyword search
  • qmd_vsearch - Vector semantic search
  • qmd_query - Hybrid search with re-ranking
  • qmd_get - Retrieve document by path or ID
  • qmd_multi_get - Retrieve multiple documents by pattern
  • qmd_status - Index health and collection information

Data Storage

qmd stores its index in SQLite at ~/.cache/qmd/index.sqlite. The schema includes:

TablePurpose
collectionsIndexed directories with name and glob patterns
path_contextsContext descriptions by virtual path
documentsMarkdown content with metadata and 6-char docid
documents_ftsFTS5 full-text index
content_vectorsEmbedding chunks with hash, sequence, position
vectors_vecsqlite-vec vector index
llm_cacheCached LLM responses for query expansion, re-ranking

The document ID system uses 6-character hashes, enabling retrieval by ID (e.g., qmd get "#abc123") without knowing the file path.

Comparison with Alternative Approaches

Grep and Read Pattern

The baseline approach most AI assistants use. Grep finds files, then the agent reads entire files. Simple to implement but token-inefficient for large collections.

RAG Systems

Traditional Retrieval-Augmented Generation systems chunk documents and retrieve via vector search. qmd improves on basic RAG through:

  • Hybrid retrieval (BM25 + vector) for better recall
  • Query expansion for improved coverage
  • Re-ranking for precision improvement
  • Local execution without API calls or cloud dependencies

Commercial Solutions

Services like Obsidian Copilot or Notion AI provide search capabilities but require cloud processing and have usage limits. qmd maintains data locality and has no per-query costs after initial model download.

Practical Deployment Considerations

Hardware Requirements

  • Bun runtime >= 1.0.0
  • macOS with Homebrew SQLite (for extension support)
  • Approximately 4GB RAM for model loading
  • 3.1GB disk space for models
  • SSD recommended for index performance

Index Maintenance

# Check index status
qmd status

# Re-index after file changes
qmd update

# Re-index with git pull for remote repos
qmd update --pull

# Clean orphaned data
qmd cleanup

Performance Characteristics

Initial embedding generation scales linearly with document count. The 800-token chunk size with 15% overlap balances retrieval granularity against embedding cost. Query latency depends on mode:

  • search (BM25): ~10-50ms
  • vsearch (vector): ~50-200ms
  • query (hybrid + re-ranking): ~500-2000ms

The query command’s latency comes primarily from LLM operations (query expansion and re-ranking).

Use Case: Personal Knowledge Assistant

Andrew Levine’s configuration demonstrates the primary use case. His setup combines:

  1. Obsidian vault with 600+ markdown notes
  2. Clawdbot (Moltbot) as the AI assistant framework
  3. qmd as the search backend via CLI or MCP

When Levine asks his assistant “what did I write about X?”, the workflow becomes:

  1. Assistant invokes qmd query "X" --json -n 5
  2. qmd returns 5 snippets with relevance scores and metadata
  3. Assistant synthesizes response from snippets
  4. Total token cost: approximately 600 tokens vs 15,000 previously

The 96% reduction enables more interactive querying without exhausting context windows or API budgets.

Key Findings

  • qmd achieves 96% token reduction for document search operations in agentic workflows
  • Hybrid search combining BM25, vector embeddings, and LLM re-ranking outperforms single-method approaches
  • Local execution using quantized GGUF models eliminates API costs and maintains data privacy
  • MCP server integration enables direct Claude Desktop and Claude Code usage
  • The tool is particularly effective for personal knowledge bases like Obsidian vaults
  • Position-aware blending preserves exact match quality while improving overall ranking

References

  1. qmd GitHub Repository - Accessed 2026-01-28
  2. Andrew Levine’s X Post - 2026-01-28
  3. Clawdbot Documentation - Accessed 2026-01-28
  4. Model Context Protocol - Accessed 2026-01-28
  5. LinkedIn discussion on qmd - Accessed 2026-01-28