PageIndex: Vectorless Reasoning-Based RAG Framework
Research Date: 2026-01-28 Source URL: https://x.com/sumanth_077/status/2013232922296561826
Reference URLs
- PageIndex GitHub Repository
- PageIndex Official Website
- PageIndex Introduction Blog
- PageIndex Documentation
- Mafin 2.5 FinanceBench Evaluation
- PageIndex MCP Server
- Hacker News Discussion
Summary
PageIndex is an open-source RAG framework developed by VectifyAI that eliminates vector databases and document chunking from retrieval systems. The framework constructs hierarchical tree structures from documents and uses LLM-powered reasoning to navigate these structures, mimicking how human experts read complex documents. This approach achieved 98.7% accuracy on the FinanceBench benchmark for financial document question-answering, compared to approximately 50% accuracy from traditional vector-based RAG systems.
The system represents a fundamental architectural shift: rather than computing semantic similarity between query embeddings and document chunk embeddings, PageIndex treats retrieval as a reasoning problem where the LLM actively decides which document sections to examine based on logical inference about document structure.
Technical Architecture
Two-Phase Operation
PageIndex operates through two distinct phases:
Phase 1 - Indexing: Documents undergo parsing and transformation into a hierarchical JSON-based tree structure. This structure resembles a table of contents but includes summaries, metadata, and page references at each node. The tree preserves natural document organization rather than imposing arbitrary chunking boundaries.
Phase 2 - Retrieval: During query time, the LLM performs reasoning-based tree search to locate relevant document sections. The model traverses the hierarchy by reasoning about which branches most likely contain the needed information.
JSON Tree Structure
The index uses a recursive node structure:
{
"node_id": "string",
"name": "string",
"description": "string",
"metadata": {},
"sub_nodes": []
}
Each node links to raw content including text, images, and tables. This “in-context index” resides within the LLM’s reasoning context during inference, enabling dynamic navigation decisions rather than relying on precomputed similarity scores.
Reasoning-Based Retrieval Process
The retrieval algorithm follows a five-step iterative loop:
This process mirrors how domain experts navigate complex documents. The LLM reasons about document structure rather than matching embeddings. For example, when asked about debt trends in a financial filing, the model might reason: “Debt trends are typically discussed in the financial summary or specific appendices. Let me examine those sections.”
AlphaGo-Inspired Tree Search
The framework draws inspiration from AlphaGo’s Monte Carlo tree search methodology. Rather than exhaustive search, the LLM performs layer-by-layer exploration, evaluating which branches to expand based on relevance probability. This bounds the search space despite potentially large document trees.
Comparison with Vector-Based RAG
Fundamental Limitations of Vector RAG
Traditional vector-based RAG systems exhibit several structural weaknesses that PageIndex addresses:
| Challenge | Vector Approach | PageIndex Solution |
|---|---|---|
| Query-Knowledge Mismatch | Surface-level similarity matching | Reasoned inference about section relevance |
| Similarity vs Relevance | Semantically similar but irrelevant chunks | Contextually appropriate information |
| Hard Chunking | Fixed-length fragments break context | Coherent dynamic section retrieval |
| Conversation History | Isolated queries lack context | Multi-turn context awareness |
| Cross-References | Fails on internal document links | Follows references via index navigation |
The Similarity-Relevance Gap
Vector similarity does not equal relevance. A query about Q3 financial results might retrieve Q2 or Q4 data because different quarters share high semantic similarity in financial reports. Every section discussing revenue, margins, or growth rates produces similar embeddings despite referring to different time periods.
This problem intensifies in professional documents where domain-specific terminology appears throughout. Legal contracts repeat standard clauses. Financial filings use consistent accounting terminology. Technical manuals reference the same component names. Vector systems struggle to distinguish between sections containing similar vocabulary but different information.
Benchmark Performance
FinanceBench Results
Mafin 2.5, powered by PageIndex, achieved 98.7% accuracy on the FinanceBench benchmark for financial document question-answering:
| System | Accuracy | Full Benchmark Coverage | Public Results |
|---|---|---|---|
| Mafin 2.5 | 98.7% | Yes (100%) | Yes |
| Fintool | 98% | No (66.7%) | No |
| Quantly | 94% | Yes (100%) | No |
| Vector RAG | ~50% | Yes | Yes |
The 48.7 percentage point improvement over traditional vector RAG demonstrates the magnitude of the accuracy gap in precision-critical domains.
Cross-Model Consistency
The 98.7% accuracy remains consistent across different base models:
- GPT-4o (public cloud deployment)
- DeepSeek v3 (privately deployable option)
This consistency suggests the accuracy gains derive from the retrieval methodology rather than model-specific capabilities.
Historical Progression
VectifyAI’s systems show substantial improvement across versions:
- Mafin 1: 38.0% accuracy
- Mafin 2.5: 98.7% accuracy
Trade-offs and Limitations
Speed vs Accuracy
PageIndex prioritizes accuracy over latency. The approach requires multiple LLM inference calls per query as the model reasons through the document tree. Vector databases provide sub-second retrieval through precomputed embeddings and approximate nearest neighbor search. PageIndex queries take longer due to sequential reasoning steps.
Cost Implications
Each query incurs LLM inference costs for tree navigation. Vector systems compute embeddings once during indexing and perform inexpensive similarity calculations at query time. For high-volume applications, PageIndex API costs may exceed vector infrastructure costs.
Scalability Concerns
Hacker News discussion raised valid questions about performance with large document collections:
- Context window limitations constrain tree size
- Reasoning overhead increases with document complexity
- The approach works best for smaller, high-value document sets
The PageIndex team acknowledged these trade-offs, positioning the system for “accuracy over speed” in specialized domains.
When Vector RAG Remains Superior
- Real-time retrieval requirements with sub-second latency
- Large-scale document corpora with millions of documents
- Recommendation-style tasks where semantic similarity approximates relevance
- Consumer-scale applications prioritizing throughput over precision
When PageIndex Excels
- Financial analysis requiring exact data from SEC filings
- Legal research with high cost of incorrect clause retrieval
- Medical records where patient history accuracy is critical
- Regulatory compliance requiring traceable retrieval paths
- Any domain where similarity does not equal relevance
Implementation Details
Installation
pip3 install --upgrade -r requirements.txt
Configuration
Create .env file with API credentials:
CHATGPT_API_KEY=your_key_here
Basic Usage
python3 run_pageindex.py --pdf_path /path/to/document.pdf
Configuration Parameters
| Parameter | Default | Description |
|---|---|---|
--model | gpt-4o-2024-11-20 | OpenAI model for reasoning |
--max-pages-per-node | 10 | Maximum pages per tree node |
--max-tokens-per-node | 20000 | Token limit per node |
--if-add-node-summary | true | Include summaries in nodes |
Markdown files supported via --md_path flag.
Deployment Options
- Self-hosted: Run locally using the open-source repository
- Cloud Service: Access via PageIndex Chat platform or API
- MCP Integration: Connect to Claude and similar LLM applications
- Enterprise: Private/on-premises deployment available
Community Reception
GitHub Traction
PageIndex reached #4 trending on GitHub with over 1,374 stars, indicating substantial developer interest in alternatives to vector-based retrieval.
Technical Critiques
Hacker News discussion surfaced several substantive concerns:
Not truly vectorless: The approach still relies heavily on LLM calls. One commenter characterized it as “Add structure with recursive LLM API calls, show LLM that structure to search.” Complexity shifts rather than disappears.
Precedent exists: Similar approaches exist in production systems including Codanna MCP and earlier document summarization implementations with table-of-contents generation.
Hybrid future: Multiple discussants predicted convergence toward hybrid approaches: vectors for initial filtering of large corpora, reasoning-based retrieval for final precise extraction.
Architectural Diagram
Key Findings
- PageIndex achieves 98.7% accuracy on FinanceBench, nearly doubling the ~50% accuracy of traditional vector RAG
- The approach eliminates vector databases, chunking, and embedding pipelines in favor of hierarchical tree indexing
- Retrieval becomes a reasoning problem where LLMs actively decide which document sections to examine
- Trade-offs include higher latency, increased per-query costs, and scalability limitations for large corpora
- Optimal use cases include precision-critical domains where retrieval errors carry high costs
- Hybrid architectures combining vector filtering with reasoning-based retrieval represent a likely industry direction
References
- PageIndex GitHub Repository - Primary source code and documentation
- Sumanth Twitter Announcement - January 19, 2026
- PageIndex Introduction Blog - Technical methodology explanation
- Mafin 2.5 FinanceBench Repository - Benchmark evaluation details
- Hacker News Discussion - Community technical critique
- ByteIota Analysis - Third-party evaluation