Computational Complexity Limits on AI Agent Capabilities

Research Date: 2026-01-26 Source URL: https://futurism.com/artificial-intelligence/ai-agents-incapable-math

Reference URLs

Summary

A July 2025 paper by Vishal Sikka (former SAP CTO, former Infosys CEO) and Varin Sikka argues that LLMs have an inherent computational ceiling that makes reliable agentic task completion mathematically impossible beyond certain complexity thresholds. The core argument rests on computational complexity theory: since LLM inference operates at O(n²·d) complexity, tasks requiring higher complexity cannot be correctly executed regardless of model scale or training data. This has direct implications for practitioners deploying AI agents, as it defines categories of tasks where agents will systematically fail or produce incorrect results.

The Computational Ceiling

LLM Inference Complexity

Standard transformer-based LLMs execute inference with computational complexity bounded by O(n²·d), where:

  • n = input sequence length (tokens)
  • d = model embedding dimension

This bound derives from the self-attention mechanism, which computes relationships between all token pairs. For a 17-token input on Llama-3.2-3B-Instruct, the model executes approximately 109 billion floating-point operations regardless of task content.

The Hartmanis-Stearns Constraint

The paper applies the time-hierarchy theorem from computational complexity theory: if a task inherently requires T(n) time complexity and T(n) exceeds the available compute budget, the task cannot be correctly completed. Applied to LLMs, this means any task embedded in a prompt requiring complexity greater than O(n²·d) will either fail or produce incorrect output.

The authors term this the “hallucination station” because the model produces output that appears valid but cannot be computationally correct.

Practical Task Categories

Tasks Within Reliable Bounds

Tasks where computational requirements fit within O(n²·d) can theoretically be handled correctly:

  • Text summarization of provided content
  • Classification against learned categories
  • Pattern matching and retrieval from context
  • Simple logical inference chains
  • Format transformation and translation

Tasks Exceeding Complexity Limits

The paper identifies several task categories that exceed LLM computational capacity:

Task TypeComplexityPractical Example
Combinatorial searchO(k^n) or O(n!)Listing all possible orderings, subset enumeration
Dense matrix operationsO(n³)Matrix multiplication, solving linear systems
Verification problemsExponentialProving shortest path, validating optimization solutions
Model checkingState explosionExhaustive software verification
Scheduling optimizationNP-completeOptimal resource allocation, crew scheduling

Agent-to-Agent Verification Fails

A significant finding concerns multi-agent architectures where one agent verifies another’s work. The paper demonstrates that if Agent A performs a task with complexity C(n), Agent B cannot reliably verify the result because verification itself often requires C(n) or greater complexity. This undermines architectures relying on LLM-based validation layers.

Reasoning Models Do Not Escape the Constraint

The paper addresses whether reasoning models (OpenAI o3, DeepSeek R1) overcome these limits. The authors argue they do not, for two reasons:

  1. Base operation unchanged: Each reasoning step still executes with O(n²·d) complexity
  2. Token budget insufficient: The “think” token allocation remains far smaller than required for genuinely complex computations

Apple’s “Illusion of Thinking” paper (June 2025) provides empirical support. Researchers tested reasoning models on Towers of Hanoi puzzles with controllable complexity and observed “reasoning collapse” where performance drops to zero beyond certain problem sizes. Models exhibited a pattern where reasoning effort increases with complexity up to a threshold, then declines despite adequate token budget.

Practical Implications for Agent Deployment

What This Means for Builders

Agent Architecture Recommendations

Do not rely on LLM agents for:

  • Exhaustive search or enumeration tasks
  • Mathematical proofs requiring extensive computation
  • Verification of complex algorithmic outputs
  • Optimization problems requiring global search
  • Tasks where correctness requires more computation than the model performs

Use composite architectures where:

  • LLM handles natural language understanding and coordination
  • External tools perform precise computation (calculators, solvers, databases)
  • Verification uses formal methods rather than LLM judgment
  • Human oversight covers high-stakes decisions

The Guardrail Strategy

Both Sikka and industry respondents agree that composite systems can mitigate these limitations. Harmonic, a startup founded by Robinhood CEO Vlad Tenev, demonstrates one approach: encoding LLM outputs in the Lean programming language for formal mathematical verification. This shifts the verification burden from LLM inference to provably correct external systems.

Sikka stated: “Our paper is saying that a pure LLM has this inherent limitation, but at the same time it is true that you can build components around LLMs that overcome those limitations.”

Industry Context

The Gap Between Promise and Delivery

The AI industry positioned 2025 as “the year of AI agents,” but deployment reality fell short. Steven Levy’s Wired analysis notes that 2025 became “the year of talking about AI agents” rather than deploying them at scale. Corporate adoption remains constrained by hallucination risks disrupting workflows.

OpenAI’s Position

In September 2025, OpenAI researchers published findings acknowledging that AI hallucinations “continue to plague the field” and that “accuracy will never reach 100 percent.” When asked to provide the lead author’s dissertation title, three tested models (including ChatGPT) fabricated titles and misreported publication years.

Counterarguments

Industry optimists argue:

  • Hallucinations may enable discovery of novel solutions humans never considered
  • Guardrails can filter incorrect outputs effectively enough for practical use
  • Task-specific fine-tuning can reduce errors in narrow domains
  • Economic incentives will drive solutions

Tudor Achim of Harmonic stated: “I think hallucinations are intrinsic to LLMs and also necessary for going beyond human intelligence. The way that systems learn is by hallucinating something. It’s often wrong, but sometimes it’s something that no human has ever thought before.”

Limitations of the Research

The Sikka paper has not undergone peer review (submitted to AAAI-26). The argument assumes tasks are executed in a single inference pass rather than through iterative refinement. Tool-augmented agents that delegate computation externally are explicitly acknowledged as a valid mitigation strategy.

The practical threshold where tasks begin failing depends on specific model architecture, context length, and task encoding. The paper provides theoretical bounds rather than empirical benchmarks for specific task categories.

Key Findings

  • LLM inference complexity is bounded by O(n²·d), creating a ceiling on reliably executable task complexity
  • Tasks requiring cubic, exponential, or factorial complexity cannot be correctly completed by pure LLM systems
  • Multi-agent verification architectures fail because verification often requires equivalent or greater complexity than the original task
  • Reasoning models exhibit “reasoning collapse” at high complexity, contradicting claims that extended thinking overcomes fundamental limits
  • Composite architectures delegating precise computation to external systems remain the viable path for reliable agent deployment
  • Human verification layers are necessary for high-stakes agentic outputs regardless of architectural sophistication

References

  1. Sikka, V., & Sikka, V. (2025). Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models. arXiv:2507.07505. https://arxiv.org/abs/2507.07505
  2. Shojaee, P., et al. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple ML Research. https://machinelearning.apple.com/research/illusion-of-thinking
  3. Levy, S. (2026, January 23). The Math on AI Agents Doesn’t Add Up. Wired. https://www.wired.com/story/ai-agents-math-doesnt-add-up/
  4. Landymore, F. (2026, January 26). AI Agents Are Mathematically Incapable of Doing Functional Work, Paper Finds. Futurism. https://futurism.com/artificial-intelligence/ai-agents-incapable-math
  5. Hartmanis, J., & Stearns, R. E. (1965). On the computational complexity of algorithms. Transactions of the American Mathematical Society, 117, 285-306.