PageIndex: Vectorless Reasoning-Based RAG Framework

Research Date: 2026-01-28 Source URL: https://x.com/sumanth_077/status/2013232922296561826

Reference URLs

Summary

PageIndex is an open-source RAG framework developed by VectifyAI that eliminates vector databases and document chunking from retrieval systems. The framework constructs hierarchical tree structures from documents and uses LLM-powered reasoning to navigate these structures, mimicking how human experts read complex documents. This approach achieved 98.7% accuracy on the FinanceBench benchmark for financial document question-answering, compared to approximately 50% accuracy from traditional vector-based RAG systems.

The system represents a fundamental architectural shift: rather than computing semantic similarity between query embeddings and document chunk embeddings, PageIndex treats retrieval as a reasoning problem where the LLM actively decides which document sections to examine based on logical inference about document structure.

Technical Architecture

Two-Phase Operation

PageIndex operates through two distinct phases:

Phase 1 - Indexing: Documents undergo parsing and transformation into a hierarchical JSON-based tree structure. This structure resembles a table of contents but includes summaries, metadata, and page references at each node. The tree preserves natural document organization rather than imposing arbitrary chunking boundaries.

Phase 2 - Retrieval: During query time, the LLM performs reasoning-based tree search to locate relevant document sections. The model traverses the hierarchy by reasoning about which branches most likely contain the needed information.

JSON Tree Structure

The index uses a recursive node structure:

{
  "node_id": "string",
  "name": "string",
  "description": "string",
  "metadata": {},
  "sub_nodes": []
}

Each node links to raw content including text, images, and tables. This “in-context index” resides within the LLM’s reasoning context during inference, enabling dynamic navigation decisions rather than relying on precomputed similarity scores.

Reasoning-Based Retrieval Process

The retrieval algorithm follows a five-step iterative loop:

This process mirrors how domain experts navigate complex documents. The LLM reasons about document structure rather than matching embeddings. For example, when asked about debt trends in a financial filing, the model might reason: “Debt trends are typically discussed in the financial summary or specific appendices. Let me examine those sections.”

The framework draws inspiration from AlphaGo’s Monte Carlo tree search methodology. Rather than exhaustive search, the LLM performs layer-by-layer exploration, evaluating which branches to expand based on relevance probability. This bounds the search space despite potentially large document trees.

Comparison with Vector-Based RAG

Fundamental Limitations of Vector RAG

Traditional vector-based RAG systems exhibit several structural weaknesses that PageIndex addresses:

ChallengeVector ApproachPageIndex Solution
Query-Knowledge MismatchSurface-level similarity matchingReasoned inference about section relevance
Similarity vs RelevanceSemantically similar but irrelevant chunksContextually appropriate information
Hard ChunkingFixed-length fragments break contextCoherent dynamic section retrieval
Conversation HistoryIsolated queries lack contextMulti-turn context awareness
Cross-ReferencesFails on internal document linksFollows references via index navigation

The Similarity-Relevance Gap

Vector similarity does not equal relevance. A query about Q3 financial results might retrieve Q2 or Q4 data because different quarters share high semantic similarity in financial reports. Every section discussing revenue, margins, or growth rates produces similar embeddings despite referring to different time periods.

This problem intensifies in professional documents where domain-specific terminology appears throughout. Legal contracts repeat standard clauses. Financial filings use consistent accounting terminology. Technical manuals reference the same component names. Vector systems struggle to distinguish between sections containing similar vocabulary but different information.

Benchmark Performance

FinanceBench Results

Mafin 2.5, powered by PageIndex, achieved 98.7% accuracy on the FinanceBench benchmark for financial document question-answering:

SystemAccuracyFull Benchmark CoveragePublic Results
Mafin 2.598.7%Yes (100%)Yes
Fintool98%No (66.7%)No
Quantly94%Yes (100%)No
Vector RAG~50%YesYes

The 48.7 percentage point improvement over traditional vector RAG demonstrates the magnitude of the accuracy gap in precision-critical domains.

Cross-Model Consistency

The 98.7% accuracy remains consistent across different base models:

  • GPT-4o (public cloud deployment)
  • DeepSeek v3 (privately deployable option)

This consistency suggests the accuracy gains derive from the retrieval methodology rather than model-specific capabilities.

Historical Progression

VectifyAI’s systems show substantial improvement across versions:

  • Mafin 1: 38.0% accuracy
  • Mafin 2.5: 98.7% accuracy

Trade-offs and Limitations

Speed vs Accuracy

PageIndex prioritizes accuracy over latency. The approach requires multiple LLM inference calls per query as the model reasons through the document tree. Vector databases provide sub-second retrieval through precomputed embeddings and approximate nearest neighbor search. PageIndex queries take longer due to sequential reasoning steps.

Cost Implications

Each query incurs LLM inference costs for tree navigation. Vector systems compute embeddings once during indexing and perform inexpensive similarity calculations at query time. For high-volume applications, PageIndex API costs may exceed vector infrastructure costs.

Scalability Concerns

Hacker News discussion raised valid questions about performance with large document collections:

  • Context window limitations constrain tree size
  • Reasoning overhead increases with document complexity
  • The approach works best for smaller, high-value document sets

The PageIndex team acknowledged these trade-offs, positioning the system for “accuracy over speed” in specialized domains.

When Vector RAG Remains Superior

  • Real-time retrieval requirements with sub-second latency
  • Large-scale document corpora with millions of documents
  • Recommendation-style tasks where semantic similarity approximates relevance
  • Consumer-scale applications prioritizing throughput over precision

When PageIndex Excels

  • Financial analysis requiring exact data from SEC filings
  • Legal research with high cost of incorrect clause retrieval
  • Medical records where patient history accuracy is critical
  • Regulatory compliance requiring traceable retrieval paths
  • Any domain where similarity does not equal relevance

Implementation Details

Installation

pip3 install --upgrade -r requirements.txt

Configuration

Create .env file with API credentials:

CHATGPT_API_KEY=your_key_here

Basic Usage

python3 run_pageindex.py --pdf_path /path/to/document.pdf

Configuration Parameters

ParameterDefaultDescription
--modelgpt-4o-2024-11-20OpenAI model for reasoning
--max-pages-per-node10Maximum pages per tree node
--max-tokens-per-node20000Token limit per node
--if-add-node-summarytrueInclude summaries in nodes

Markdown files supported via --md_path flag.

Deployment Options

  1. Self-hosted: Run locally using the open-source repository
  2. Cloud Service: Access via PageIndex Chat platform or API
  3. MCP Integration: Connect to Claude and similar LLM applications
  4. Enterprise: Private/on-premises deployment available

Community Reception

GitHub Traction

PageIndex reached #4 trending on GitHub with over 1,374 stars, indicating substantial developer interest in alternatives to vector-based retrieval.

Technical Critiques

Hacker News discussion surfaced several substantive concerns:

Not truly vectorless: The approach still relies heavily on LLM calls. One commenter characterized it as “Add structure with recursive LLM API calls, show LLM that structure to search.” Complexity shifts rather than disappears.

Precedent exists: Similar approaches exist in production systems including Codanna MCP and earlier document summarization implementations with table-of-contents generation.

Hybrid future: Multiple discussants predicted convergence toward hybrid approaches: vectors for initial filtering of large corpora, reasoning-based retrieval for final precise extraction.

Architectural Diagram

Key Findings

  • PageIndex achieves 98.7% accuracy on FinanceBench, nearly doubling the ~50% accuracy of traditional vector RAG
  • The approach eliminates vector databases, chunking, and embedding pipelines in favor of hierarchical tree indexing
  • Retrieval becomes a reasoning problem where LLMs actively decide which document sections to examine
  • Trade-offs include higher latency, increased per-query costs, and scalability limitations for large corpora
  • Optimal use cases include precision-critical domains where retrieval errors carry high costs
  • Hybrid architectures combining vector filtering with reasoning-based retrieval represent a likely industry direction

References

  1. PageIndex GitHub Repository - Primary source code and documentation
  2. Sumanth Twitter Announcement - January 19, 2026
  3. PageIndex Introduction Blog - Technical methodology explanation
  4. Mafin 2.5 FinanceBench Repository - Benchmark evaluation details
  5. Hacker News Discussion - Community technical critique
  6. ByteIota Analysis - Third-party evaluation