PageIndex: Vectorless Reasoning-Based RAG Framework

Research Date: 2026-01-28 Source URL: https://x.com/sumanth_077/status/2013232922296561826

Reference URLs

Summary

PageIndex is an open-source RAG framework developed by VectifyAI that eliminates vector databases and document chunking from retrieval systems. The framework constructs hierarchical tree structures from documents and uses LLM-powered reasoning to navigate these structures, mimicking how human experts read complex documents. This approach achieved 98.7% accuracy on the FinanceBench benchmark for financial document question-answering, compared to approximately 50% accuracy from traditional vector-based RAG systems.

The system represents a fundamental architectural shift: rather than computing semantic similarity between query embeddings and document chunk embeddings, PageIndex treats retrieval as a reasoning problem where the LLM actively decides which document sections to examine based on logical inference about document structure.

Technical Architecture

Two-Phase Operation

PageIndex operates through two distinct phases:

Phase 1 - Indexing: Documents undergo parsing and transformation into a hierarchical JSON-based tree structure. This structure resembles a table of contents but includes summaries, metadata, and page references at each node. The tree preserves natural document organization rather than imposing arbitrary chunking boundaries.

Phase 2 - Retrieval: During query time, the LLM performs reasoning-based tree search to locate relevant document sections. The model traverses the hierarchy by reasoning about which branches most likely contain the needed information.

JSON Tree Structure

The index uses a recursive node structure:

{
  "node_id": "string",
  "name": "string",
  "description": "string",
  "metadata": {},
  "sub_nodes": []
}

Each node links to raw content including text, images, and tables. This “in-context index” resides within the LLM’s reasoning context during inference, enabling dynamic navigation decisions rather than relying on precomputed similarity scores.

Reasoning-Based Retrieval Process

The retrieval algorithm follows a five-step iterative loop:

flowchart TD ReadToC[Read Table of Contents] --> SelectSection[Select Most Relevant Section] SelectSection --> ExtractInfo[Extract Information from Section] ExtractInfo --> AssessSufficiency{Information Sufficient} AssessSufficiency -->|No| SelectSection AssessSufficiency -->|Yes| GenerateAnswer[Generate Final Answer]

This process mirrors how domain experts navigate complex documents. The LLM reasons about document structure rather than matching embeddings. For example, when asked about debt trends in a financial filing, the model might reason: “Debt trends are typically discussed in the financial summary or specific appendices. Let me examine those sections.”

AlphaGo-Inspired Tree Search

The framework draws inspiration from AlphaGo’s Monte Carlo tree search methodology. Rather than exhaustive search, the LLM performs layer-by-layer exploration, evaluating which branches to expand based on relevance probability. This bounds the search space despite potentially large document trees.

Comparison with Vector-Based RAG

Fundamental Limitations of Vector RAG

Traditional vector-based RAG systems exhibit several structural weaknesses that PageIndex addresses:

Challenge	Vector Approach	PageIndex Solution
Query-Knowledge Mismatch	Surface-level similarity matching	Reasoned inference about section relevance
Similarity vs Relevance	Semantically similar but irrelevant chunks	Contextually appropriate information
Hard Chunking	Fixed-length fragments break context	Coherent dynamic section retrieval
Conversation History	Isolated queries lack context	Multi-turn context awareness
Cross-References	Fails on internal document links	Follows references via index navigation

The Similarity-Relevance Gap

Vector similarity does not equal relevance. A query about Q3 financial results might retrieve Q2 or Q4 data because different quarters share high semantic similarity in financial reports. Every section discussing revenue, margins, or growth rates produces similar embeddings despite referring to different time periods.

This problem intensifies in professional documents where domain-specific terminology appears throughout. Legal contracts repeat standard clauses. Financial filings use consistent accounting terminology. Technical manuals reference the same component names. Vector systems struggle to distinguish between sections containing similar vocabulary but different information.

Benchmark Performance

FinanceBench Results

Mafin 2.5, powered by PageIndex, achieved 98.7% accuracy on the FinanceBench benchmark for financial document question-answering:

System	Accuracy	Full Benchmark Coverage	Public Results
Mafin 2.5	98.7%	Yes (100%)	Yes
Fintool	98%	No (66.7%)	No
Quantly	94%	Yes (100%)	No
Vector RAG	~50%	Yes	Yes

The 48.7 percentage point improvement over traditional vector RAG demonstrates the magnitude of the accuracy gap in precision-critical domains.

Cross-Model Consistency

The 98.7% accuracy remains consistent across different base models:

GPT-4o (public cloud deployment)
DeepSeek v3 (privately deployable option)

This consistency suggests the accuracy gains derive from the retrieval methodology rather than model-specific capabilities.

Historical Progression

VectifyAI’s systems show substantial improvement across versions:

Mafin 1: 38.0% accuracy
Mafin 2.5: 98.7% accuracy

Trade-offs and Limitations

Speed vs Accuracy

PageIndex prioritizes accuracy over latency. The approach requires multiple LLM inference calls per query as the model reasons through the document tree. Vector databases provide sub-second retrieval through precomputed embeddings and approximate nearest neighbor search. PageIndex queries take longer due to sequential reasoning steps.

Cost Implications

Each query incurs LLM inference costs for tree navigation. Vector systems compute embeddings once during indexing and perform inexpensive similarity calculations at query time. For high-volume applications, PageIndex API costs may exceed vector infrastructure costs.

Scalability Concerns

Hacker News discussion raised valid questions about performance with large document collections:

Context window limitations constrain tree size
Reasoning overhead increases with document complexity
The approach works best for smaller, high-value document sets

The PageIndex team acknowledged these trade-offs, positioning the system for “accuracy over speed” in specialized domains.

When Vector RAG Remains Superior

Real-time retrieval requirements with sub-second latency
Large-scale document corpora with millions of documents
Recommendation-style tasks where semantic similarity approximates relevance
Consumer-scale applications prioritizing throughput over precision

When PageIndex Excels

Financial analysis requiring exact data from SEC filings
Legal research with high cost of incorrect clause retrieval
Medical records where patient history accuracy is critical
Regulatory compliance requiring traceable retrieval paths
Any domain where similarity does not equal relevance

Implementation Details

Installation

pip3 install --upgrade -r requirements.txt

Configuration

Create .env file with API credentials:

CHATGPT_API_KEY=your_key_here

Basic Usage

python3 run_pageindex.py --pdf_path /path/to/document.pdf

Configuration Parameters

Parameter	Default	Description
`--model`	gpt-4o-2024-11-20	OpenAI model for reasoning
`--max-pages-per-node`	10	Maximum pages per tree node
`--max-tokens-per-node`	20000	Token limit per node
`--if-add-node-summary`	true	Include summaries in nodes

Markdown files supported via --md_path flag.

Deployment Options

Self-hosted: Run locally using the open-source repository
Cloud Service: Access via PageIndex Chat platform or API
MCP Integration: Connect to Claude and similar LLM applications
Enterprise: Private/on-premises deployment available

Community Reception

GitHub Traction

PageIndex reached #4 trending on GitHub with over 1,374 stars, indicating substantial developer interest in alternatives to vector-based retrieval.

Technical Critiques

Hacker News discussion surfaced several substantive concerns:

Not truly vectorless: The approach still relies heavily on LLM calls. One commenter characterized it as “Add structure with recursive LLM API calls, show LLM that structure to search.” Complexity shifts rather than disappears.

Precedent exists: Similar approaches exist in production systems including Codanna MCP and earlier document summarization implementations with table-of-contents generation.

Hybrid future: Multiple discussants predicted convergence toward hybrid approaches: vectors for initial filtering of large corpora, reasoning-based retrieval for final precise extraction.

Architectural Diagram

flowchart LR subgraph Indexing["Indexing Phase"] RawDoc[Raw Document] --> Parse[Parse Structure] Parse --> BuildTree[Build Tree Index] BuildTree --> JsonIndex[JSON Tree Index] end subgraph Retrieval["Retrieval Phase"] Query[User Query] --> LoadIndex[Load Tree Index] LoadIndex --> Reason[LLM Reasoning] Reason --> Navigate[Navigate Tree] Navigate --> Extract[Extract Content] Extract --> Answer[Generate Answer] end JsonIndex --> LoadIndex

Key Findings

PageIndex achieves 98.7% accuracy on FinanceBench, nearly doubling the ~50% accuracy of traditional vector RAG
The approach eliminates vector databases, chunking, and embedding pipelines in favor of hierarchical tree indexing
Retrieval becomes a reasoning problem where LLMs actively decide which document sections to examine
Trade-offs include higher latency, increased per-query costs, and scalability limitations for large corpora
Optimal use cases include precision-critical domains where retrieval errors carry high costs
Hybrid architectures combining vector filtering with reasoning-based retrieval represent a likely industry direction

References

PageIndex GitHub Repository - Primary source code and documentation
Sumanth Twitter Announcement - January 19, 2026
PageIndex Introduction Blog - Technical methodology explanation
Mafin 2.5 FinanceBench Repository - Benchmark evaluation details
Hacker News Discussion - Community technical critique
ByteIota Analysis - Third-party evaluation