Fintool AI Agent Architecture: Lessons from Financial Services

Research Date: 2026-01-26 Source URL: https://x.com/nicbstme/status/2015174818497437834 Author: Nicolas Bustamante (@nicbstme), Fintool

Reference URLs

Summary

This article documents architectural decisions and operational lessons from two years of building Fintool, an AI agent platform serving professional investors. The author presents eleven core lessons spanning infrastructure (sandboxed execution, S3-first storage, Temporal workflows), data engineering (context normalization, adversarial document parsing), agent design (markdown skills, filesystem tools), and operations (domain-specific evaluation, production monitoring).

The central thesis posits that competitive advantage in AI agent products derives not from model access but from surrounding infrastructure: data quality, domain-encoded skills, user experience, and operational reliability. The article advocates for designing systems with planned obsolescence as model capabilities improve, while building durable moats in data and domain expertise.

The financial services context imposes stringent accuracy requirements—errors in revenue figures or valuation assumptions can result in significant financial losses for users making investment decisions based on agent output.

Domain Context: Financial Services Constraints

Error Intolerance

Financial services represents a high-stakes domain where agent errors carry concrete consequences:

Error TypePotential Impact
Incorrect revenue figureMisinformed investment thesis
Misinterpreted guidanceIncorrect earnings expectations
Wrong DCF assumptionFlawed valuation model
Fiscal period confusionComparing non-comparable quarters

The author notes that professional investors “spot bullshit instantly” and require “precision, speed, and depth.” This creates what the author describes as “paranoid attention to detail” where every number receives validation and every assumption undergoes stress testing.

User Sophistication

Target users are described as “some of the smartest, most time-pressed people” who cannot accept hand-waving through valuation models or glossed-over nuances. This user profile drives the technical requirements discussed throughout the article.

Infrastructure Architecture

Sandboxed Execution Environments

The article argues that sandboxing is non-optional for multi-step agent workflows. The author recounts an incident where an LLM attempted to execute rm -rf / while “trying to clean up temporary files.”

Architecture Pattern:

Key Implementation Details:

ComponentImplementation
Access ControlAWS ABAC with ${aws:PrincipalTag/S3Prefix} restrictions
Credential ScopeShort-lived, scoped to specific S3 prefixes
Sandbox Lifecycle600-second timeout, extended 10 minutes per tool usage
Pre-warmingSandbox initialization begins when user starts typing

The three-tier mount system (private/shared/public) enables organizational data sharing while maintaining user isolation. IAM policies physically prevent cross-user data access.

S3-First Architecture

The article advocates for S3 as the primary data store over traditional databases for user data (watchlists, portfolios, preferences, memories, skills):

Rationale:

FactorS3 Advantage
Durability11 nines (99.999999999%)
VersioningBuilt-in audit trails
SimplicityYAML files are human-readable, debuggable
CostLower than equivalent database storage

Sync Architecture:

  • Real-time: Lambda triggered by S3 events via SNS performs upsert/delete
  • Reconciliation: EventBridge-scheduled Lambda (every 3 hours) performs full S3-to-DB scan
  • Conflict resolution: Timestamp guards ensure newer data wins

User memories are stored as markdown files (/private/memories/UserMemories.md) that users can edit directly. These are injected as context on every conversation:

<user-memories>
{user_memories}
</user-memories>

Temporal for Long-Running Tasks

The article describes Temporal as transformative for handling multi-minute agent workflows:

Problem Context:

  • Company analysis tasks may require 5+ minutes
  • Server restarts, tab closures, and network issues interrupt homegrown job queues
  • State management and retry consistency were problematic

Temporal Benefits:

CapabilityImplementation
Automatic RetryWorker crash triggers automatic retry on another worker
State PersistenceWorkflow state survives infrastructure failures
CancellationHeartbeat-based cancellation handling

Worker Configuration:

Worker TypeConcurrent ActivitiesPurpose
Chat25User-facing requests
Background10Async tasks

Cancellation requires heartbeats sent every few seconds, with the activity checking cancellation status between heartbeats.

Data Engineering: Context as Product

The Normalization Challenge

The article asserts that “your agent is only as good as the context it can access” and that the “real work isn’t prompt engineering—it’s turning messy financial data from dozens of sources into clean, structured context.”

Source Heterogeneity:

Source TypeFormat Characteristics
SEC FilingsHTML with nested tables, exhibits, signatures
Earnings TranscriptsSpeaker-segmented text with Q&A sections
Press ReleasesSemi-structured HTML from PRNewswire
Research ReportsPDFs with charts and footnotes
Market DataStructured numerical data from Snowflake/databases
NewsArticles with varying quality and structure
Alternative DataSatellite imagery, web traffic, credit card panels
Broker ResearchProprietary PDFs with price targets and models
Fund Filings13F holdings, proxy statements, activist letters

Normalization Output:

All sources are converted to three formats:

  1. Markdown: Narrative content (filings, transcripts, articles)
  2. CSV/Tables: Structured data (financials, metrics, comparisons)
  3. JSON Metadata: Searchability attributes (tickers, dates, document types, fiscal periods)

Chunking Strategy

Different document types require different chunking approaches:

Document TypeChunking Strategy
10-K FilingsBy regulatory section (Item 1, 1A, 7, 8…)
Earnings TranscriptsBy speaker turn (CEO, CFO, Q&A by analyst)
Press ReleasesSingle chunk (typically small enough)
News ArticlesParagraph-level chunks
13F FilingsBy holder and position changes quarter-over-quarter

The article notes that “chunking strategy determines what context the agent retrieves. Bad chunks = bad answers.”

Table Handling

LLMs demonstrate strong reasoning capability over markdown tables but perform poorly on raw HTML <table> tags or CSV dumps. The normalization layer converts all tabular data to clean markdown format.

Metadata-Enabled Retrieval

Every document includes meta.json for structured metadata enabling filtered retrieval:

The Parsing Problem

SEC Filing Adversarial Characteristics

SEC filings are characterized as “not designed for machine reading” but “designed for legal compliance”:

ChallengeDescription
Multi-page tablesTables span pages with repeated headers
Circular referencesFootnotes reference exhibits referencing footnotes
Inconsistent numbersSame figures appear differently in text vs. tables
Unreliable XBRLTags often wrong or incomplete
Format varianceEach law firm uses different templates

Off-the-shelf parser failures:

  • Multi-column layouts in proxy statements
  • Nested tables in MD&A sections (tables within tables)
  • Watermarks and headers bleeding into content
  • Scanned exhibits (still common in older filings)
  • Unicode issues (curly quotes, em-dashes, non-breaking spaces)

Parsing Pipeline

Table Extraction Complexity

Financial tables contain dense semantic information:

Table ElementChallenge
Merged headersCells spanning multiple columns
Footnote markers(1), (2), (a), (b) referencing explanations below
Negative notation$(1,234) means -1234
Mixed unitsMillions for revenue, percentages for margins
RestatementsPrior period adjustments in italics or with asterisks

Quality Scoring Dimensions:

  • Cell boundary accuracy (split/merge correctness)
  • Header detection (row 1 vs. title row above)
  • Numeric parsing (text vs. parsed number)
  • Unit inference (millions, billions, per share, percentage)

Tables scoring below 90% confidence are flagged for review and excluded from agent context.

Fiscal Period Normalization

“Q1 2024” is ambiguous without company context:

CompanyQ1 2024 MeansFiscal Year End
CalendarJanuary-March 2024December
AppleOctober-December 2023September
MicrosoftJuly-September 2023June

The system maintains fiscal calendars for 10,000+ companies, normalizing all date references to absolute date ranges. Without this normalization, comparisons would conflate non-comparable periods.

Agent Design: Skills Architecture

Skills as First-Class Citizens

The article’s central agent design thesis: “the model is not the product. The skills are now the product.”

Without skills, frontier models “know what DCF is” and “can explain the theory” but produce subtly incorrect output when executing: missing critical steps, using wrong discount rates, forgetting stock-based compensation add-backs, skipping sensitivity analysis.

Skill Structure (Anthropic Agent Skills Specification, October 2025):

A skill is a folder containing SKILL.md with YAML frontmatter plus supporting files:

# dcf

## When to Use
Use this skill for discounted cash flow valuations.

## Instructions
1. Deep dive on the company using Task tool
2. Identify industry and load industry-specific guidelines
3. Gather financial data: revenue, margins, CapEx, working capital
4. Build DCF model in Excel using xlsx skill
5. Calculate WACC using industry benchmarks
6. Run sensitivity analysis on WACC and terminal growth
7. Validate: reconcile to actuals, compare to market price
8. Document view vs market pricing

## Industry Guidelines
- Technology/SaaS: `/public/skills/dcf/guidelines/technology-saas.md`
- Healthcare/Pharma: `/public/skills/dcf/guidelines/healthcare-pharma-biotech.md`
- Financial Services: `/public/skills/dcf/guidelines/financial-services.md`
[... 10+ industries]

Skills vs. Code

AdvantageExplanation
Non-engineer authorshipAnalysts and customers can create skills
No deploymentFile changes take effect immediately
AuditabilityHuman-readable intent vs. opaque code

Shadowing System

Priority resolution: private > shared > public

Users can override default skills by placing custom versions in /private/skills/. The user’s version wins.

SQL-Based Skill Discovery

Skills are not mounted directly to the filesystem. Instead, metadata is queried from PostgreSQL:

SELECT user_id, path, metadata
FROM fs_files
WHERE user_id = ANY(:user_ids)
  AND path LIKE 'skills/%/SKILL.md'

Rationale:

ConcernSQL Solution
Token efficiencyLazy loading—full docs loaded only when skill invoked
Access controlThree-tier access enforced at query time
Shadowing logicPriority resolution via SQL vs. filesystem symlinks
Metadata filteringQuery YAML frontmatter without reading files

The author emphasizes: “Top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills.”

Planned Obsolescence

The article advises designing skills “knowing that future models will need less hand-holding.” Detailed step-by-step instructions needed today may become one-liners as model capabilities improve.

Strategy:

  • Write skills for current limitations
  • Delete skills when they become unnecessary
  • Build new skills for emerging harder problems
  • Prefer markdown over code for easier modification and deletion

The author predicts: “in two years, most of our basic skills will be one-liners.”

Filesystem Tools

Core Tool Set

ToolPurpose
ReadFileHandle complexity of various file formats
WriteFileCreate artifacts linked back to UI
BashPersistent shell access (180s timeout, 100K char limit)

Files written to /private/artifacts/ become clickable links in the UI via computer://user_id/artifacts/ protocol.

Bash as Exploration Tool

The article references Braintrust evaluation comparing SQL agents, bash agents, and hybrid approaches:

ApproachAccuracyTrade-off
Pure SQL100%Missed edge cases
Pure BashLowerSlower, more expensive, caught verification
HybridBestBash for exploration, SQL for structured

The author advocates “full shell access in the sandbox” for exploration, verification, and ad-hoc data manipulation that complex financial tasks require.

Real-Time Streaming

Delta Updates Architecture

Delta Operations:

OperationPurpose
ADDInsert object at index
APPENDAppend to string/array
REPLACEReplace content
PATCHPartial update
TRUNCATERemove content

Delta updates send “append these 50 characters” rather than “here’s the complete response so far” for efficiency.

Rich Content Streaming

Streamdown renders markdown progressively with custom plugins for:

  • Charts (progressive rendering)
  • Citations (linked to source documents)
  • Math equations (KaTeX)

Interactive Agent Workflows

AskUserQuestion tool enables mid-workflow user input:

This transforms agents “from autonomous black boxes into collaborative tools” where users validate key assumptions in high-stakes financial work.

Evaluation System

Domain-Specific Test Categories

The article describes ~2,000 test cases across categories. Generic NLP metrics (BLEU, ROUGE) fail for finance because “a response can be semantically similar but have completely wrong numbers.”

Ticker Disambiguation:

InputCorrectIncorrect Alternative
”Apple”AAPLAPLE (Appel Petroleum)
“Meta”METAMSTR
”Delta”DALDelta hedging (options)

Historical Ticker Changes:

  • Facebook → META (2021)
  • Google → GOOG/GOOGL restructure
  • Twitter → X

Queries about “Facebook stock in 2023” require understanding FB → META mapping.

Fiscal Period Testing:

“Last quarter” on January 15th means different things:

Company Type”Last Quarter” Means
Calendar-yearQ4 2024
AppleQ1 2025 (just reported)
MicrosoftQ2 2025 (mid-quarter)

200+ test cases cover period extraction.

Numeric Precision:

All equivalent: $4.2B, $4,200M, $4.2 billion, “four point two billion”

Fails: “4.2” without units (millions? billions? per share?)

Adversarial Grounding:

Fake numbers are injected into context alongside real sources. If the agent cites the planted fake ($50B) instead of the real 10-K figure ($94B), the test fails. 50 test cases specifically target hallucination resistance.

Eval-Driven Development

  • Every skill has companion eval
  • PR blocked if eval score drops >5%
  • DCF skill has 40 test cases covering WACC edge cases, terminal value sanity, SBC add-backs

Production Monitoring

Observability Stack

Model Routing

Query ComplexityModelRationale
SimpleHaikuCost-effective
ComplexSonnetHigher quality
EnterpriseBestAlways premium

Strategic Thesis

The Model Is Not the Product

The article concludes with a strategic framework:

“Anyone can call Claude or GPT. The API is the same for everyone. What makes your product different is everything else.”

Durable Moats:

AssetMoat Characteristic
Financial dataNormalized decades of filings
Domain skillsEncoded expertise from analysts and customers
Real-time UXStreaming, interactive workflows
User trustTrack record with professional investors
Domain knowledgeTime spent with customers understanding needs

RAG to Agentic Search Transition

The author references a prior “RAG obituary” article and describes retiring embedding pipelines in favor of fully agentic search. This architectural shift was informed by discussions with Anthropic’s Claude Code team about “filesystem-first agentic approach.”

The claim that “most startups are adopting these best practices” after initial skepticism suggests this represents an emerging consensus in agent architecture.

Key Findings

  • Sandboxed execution with user-isolated environments is mandatory for multi-step agent workflows executing arbitrary code
  • S3-first architecture with PostgreSQL sync provides superior durability, versioning, and cost characteristics for user data
  • Temporal workflows solve long-running task reliability with automatic retry and proper cancellation handling
  • Context normalization (converting heterogeneous financial data to clean markdown/CSV/JSON) represents the majority of engineering work
  • SEC filing parsing requires custom pipelines handling adversarial document characteristics; off-the-shelf parsers fail on edge cases
  • Markdown skills encode domain expertise and represent the durable product; models are commoditizing
  • Domain-specific evaluation (~2,000 test cases) catches errors that generic NLP metrics miss
  • The competitive moat lies in data, skills, UX, and domain expertise—not model access

References

  1. Nicolas Bustamante Twitter Thread - 2026-01-26
  2. Anthropic Agent Skills Specification - October 2025
  3. Temporal Workflow Engine
  4. Braintrust Evaluation Platform
  5. AWS ABAC Documentation