Computational Complexity Limits on AI Agent Capabilities

Research Date: 2026-01-26 Source URL: https://futurism.com/artificial-intelligence/ai-agents-incapable-math

Reference URLs

Summary

A July 2025 paper by Vishal Sikka (former SAP CTO, former Infosys CEO) and Varin Sikka argues that LLMs have an inherent computational ceiling that makes reliable agentic task completion mathematically impossible beyond certain complexity thresholds. The core argument rests on computational complexity theory: since LLM inference operates at O(n²·d) complexity, tasks requiring higher complexity cannot be correctly executed regardless of model scale or training data. This has direct implications for practitioners deploying AI agents, as it defines categories of tasks where agents will systematically fail or produce incorrect results.

The Computational Ceiling

LLM Inference Complexity

Standard transformer-based LLMs execute inference with computational complexity bounded by O(n²·d), where:

n = input sequence length (tokens)
d = model embedding dimension

This bound derives from the self-attention mechanism, which computes relationships between all token pairs. For a 17-token input on Llama-3.2-3B-Instruct, the model executes approximately 109 billion floating-point operations regardless of task content.

The Hartmanis-Stearns Constraint

The paper applies the time-hierarchy theorem from computational complexity theory: if a task inherently requires T(n) time complexity and T(n) exceeds the available compute budget, the task cannot be correctly completed. Applied to LLMs, this means any task embedded in a prompt requiring complexity greater than O(n²·d) will either fail or produce incorrect output.

The authors term this the “hallucination station” because the model produces output that appears valid but cannot be computationally correct.

Practical Task Categories

Tasks Within Reliable Bounds

Tasks where computational requirements fit within O(n²·d) can theoretically be handled correctly:

Text summarization of provided content
Classification against learned categories
Pattern matching and retrieval from context
Simple logical inference chains
Format transformation and translation

Tasks Exceeding Complexity Limits

The paper identifies several task categories that exceed LLM computational capacity:

Task Type	Complexity	Practical Example
Combinatorial search	O(k^n) or O(n!)	Listing all possible orderings, subset enumeration
Dense matrix operations	O(n³)	Matrix multiplication, solving linear systems
Verification problems	Exponential	Proving shortest path, validating optimization solutions
Model checking	State explosion	Exhaustive software verification
Scheduling optimization	NP-complete	Optimal resource allocation, crew scheduling

Agent-to-Agent Verification Fails

A significant finding concerns multi-agent architectures where one agent verifies another’s work. The paper demonstrates that if Agent A performs a task with complexity C(n), Agent B cannot reliably verify the result because verification itself often requires C(n) or greater complexity. This undermines architectures relying on LLM-based validation layers.

Reasoning Models Do Not Escape the Constraint

The paper addresses whether reasoning models (OpenAI o3, DeepSeek R1) overcome these limits. The authors argue they do not, for two reasons:

Base operation unchanged: Each reasoning step still executes with O(n²·d) complexity
Token budget insufficient: The “think” token allocation remains far smaller than required for genuinely complex computations

Apple’s “Illusion of Thinking” paper (June 2025) provides empirical support. Researchers tested reasoning models on Towers of Hanoi puzzles with controllable complexity and observed “reasoning collapse” where performance drops to zero beyond certain problem sizes. Models exhibited a pattern where reasoning effort increases with complexity up to a threshold, then declines despite adequate token budget.

Practical Implications for Agent Deployment

What This Means for Builders

flowchart TD TaskInput[Receive Task] --> ComplexityCheck{Estimate Task Complexity} ComplexityCheck -->|Low - within O n²d| DirectExecution[LLM Executes Directly] ComplexityCheck -->|High - exceeds bounds| Decompose[Decompose into Subtasks] Decompose --> SubtaskCheck{Each Subtask within Bounds} SubtaskCheck -->|Yes| SequentialExec[Execute Subtasks Sequentially] SubtaskCheck -->|No| ExternalCompute[Delegate to External System] DirectExecution --> HumanReview[Human Verification Layer] SequentialExec --> HumanReview ExternalCompute --> HumanReview HumanReview --> Output[Deliver Result]

Agent Architecture Recommendations

Do not rely on LLM agents for:

Exhaustive search or enumeration tasks
Mathematical proofs requiring extensive computation
Verification of complex algorithmic outputs
Optimization problems requiring global search
Tasks where correctness requires more computation than the model performs

Use composite architectures where:

LLM handles natural language understanding and coordination
External tools perform precise computation (calculators, solvers, databases)
Verification uses formal methods rather than LLM judgment
Human oversight covers high-stakes decisions

The Guardrail Strategy

Both Sikka and industry respondents agree that composite systems can mitigate these limitations. Harmonic, a startup founded by Robinhood CEO Vlad Tenev, demonstrates one approach: encoding LLM outputs in the Lean programming language for formal mathematical verification. This shifts the verification burden from LLM inference to provably correct external systems.

Sikka stated: “Our paper is saying that a pure LLM has this inherent limitation, but at the same time it is true that you can build components around LLMs that overcome those limitations.”

Industry Context

The Gap Between Promise and Delivery

The AI industry positioned 2025 as “the year of AI agents,” but deployment reality fell short. Steven Levy’s Wired analysis notes that 2025 became “the year of talking about AI agents” rather than deploying them at scale. Corporate adoption remains constrained by hallucination risks disrupting workflows.

OpenAI’s Position

In September 2025, OpenAI researchers published findings acknowledging that AI hallucinations “continue to plague the field” and that “accuracy will never reach 100 percent.” When asked to provide the lead author’s dissertation title, three tested models (including ChatGPT) fabricated titles and misreported publication years.

Counterarguments

Industry optimists argue:

Hallucinations may enable discovery of novel solutions humans never considered
Guardrails can filter incorrect outputs effectively enough for practical use
Task-specific fine-tuning can reduce errors in narrow domains
Economic incentives will drive solutions

Tudor Achim of Harmonic stated: “I think hallucinations are intrinsic to LLMs and also necessary for going beyond human intelligence. The way that systems learn is by hallucinating something. It’s often wrong, but sometimes it’s something that no human has ever thought before.”

Limitations of the Research

The Sikka paper has not undergone peer review (submitted to AAAI-26). The argument assumes tasks are executed in a single inference pass rather than through iterative refinement. Tool-augmented agents that delegate computation externally are explicitly acknowledged as a valid mitigation strategy.

The practical threshold where tasks begin failing depends on specific model architecture, context length, and task encoding. The paper provides theoretical bounds rather than empirical benchmarks for specific task categories.

Key Findings

LLM inference complexity is bounded by O(n²·d), creating a ceiling on reliably executable task complexity
Tasks requiring cubic, exponential, or factorial complexity cannot be correctly completed by pure LLM systems
Multi-agent verification architectures fail because verification often requires equivalent or greater complexity than the original task
Reasoning models exhibit “reasoning collapse” at high complexity, contradicting claims that extended thinking overcomes fundamental limits
Composite architectures delegating precise computation to external systems remain the viable path for reliable agent deployment
Human verification layers are necessary for high-stakes agentic outputs regardless of architectural sophistication

References

Sikka, V., & Sikka, V. (2025). Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models. arXiv:2507.07505. https://arxiv.org/abs/2507.07505
Shojaee, P., et al. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple ML Research. https://machinelearning.apple.com/research/illusion-of-thinking
Levy, S. (2026, January 23). The Math on AI Agents Doesn’t Add Up. Wired. https://www.wired.com/story/ai-agents-math-doesnt-add-up/
Landymore, F. (2026, January 26). AI Agents Are Mathematically Incapable of Doing Functional Work, Paper Finds. Futurism. https://futurism.com/artificial-intelligence/ai-agents-incapable-math
Hartmanis, J., & Stearns, R. E. (1965). On the computational complexity of algorithms. Transactions of the American Mathematical Society, 117, 285-306.