Fintool AI Agent Architecture: Lessons from Financial Services

Research Date: 2026-01-26 Source URL: https://x.com/nicbstme/status/2015174818497437834 Author: Nicolas Bustamante (@nicbstme), Fintool

Reference URLs

Original Twitter/X Thread
Anthropic Agent Skills Specification (referenced October 2025)
Braintrust Eval Platform (referenced for evaluation)
Temporal Workflow Engine (referenced for long-running tasks)

Summary

This article documents architectural decisions and operational lessons from two years of building Fintool, an AI agent platform serving professional investors. The author presents eleven core lessons spanning infrastructure (sandboxed execution, S3-first storage, Temporal workflows), data engineering (context normalization, adversarial document parsing), agent design (markdown skills, filesystem tools), and operations (domain-specific evaluation, production monitoring).

The central thesis posits that competitive advantage in AI agent products derives not from model access but from surrounding infrastructure: data quality, domain-encoded skills, user experience, and operational reliability. The article advocates for designing systems with planned obsolescence as model capabilities improve, while building durable moats in data and domain expertise.

The financial services context imposes stringent accuracy requirements—errors in revenue figures or valuation assumptions can result in significant financial losses for users making investment decisions based on agent output.

Domain Context: Financial Services Constraints

Error Intolerance

Financial services represents a high-stakes domain where agent errors carry concrete consequences:

Error Type	Potential Impact
Incorrect revenue figure	Misinformed investment thesis
Misinterpreted guidance	Incorrect earnings expectations
Wrong DCF assumption	Flawed valuation model
Fiscal period confusion	Comparing non-comparable quarters

The author notes that professional investors “spot bullshit instantly” and require “precision, speed, and depth.” This creates what the author describes as “paranoid attention to detail” where every number receives validation and every assumption undergoes stress testing.

User Sophistication

Target users are described as “some of the smartest, most time-pressed people” who cannot accept hand-waving through valuation models or glossed-over nuances. This user profile drives the technical requirements discussed throughout the article.

Infrastructure Architecture

Sandboxed Execution Environments

The article argues that sandboxing is non-optional for multi-step agent workflows. The author recounts an incident where an LLM attempted to execute rm -rf / while “trying to clean up temporary files.”

Architecture Pattern:

flowchart TD subgraph UserSandbox["User Sandbox - Isolated"] PrivateMount["/private - Read/Write"] SharedMount["/shared - Read-Only Org"] PublicMount["/public - Read-Only All"] end UserRequest[User Request] --> SandboxSpawn[Sandbox Pre-warming] SandboxSpawn --> AgentExecution[Agent Execution] AgentExecution --> PrivateMount AgentExecution --> SharedMount AgentExecution --> PublicMount AWSABAC[AWS ABAC Credentials] --> PrivateMount

Key Implementation Details:

Component	Implementation
Access Control	AWS ABAC with `${aws:PrincipalTag/S3Prefix}` restrictions
Credential Scope	Short-lived, scoped to specific S3 prefixes
Sandbox Lifecycle	600-second timeout, extended 10 minutes per tool usage
Pre-warming	Sandbox initialization begins when user starts typing

The three-tier mount system (private/shared/public) enables organizational data sharing while maintaining user isolation. IAM policies physically prevent cross-user data access.

S3-First Architecture

The article advocates for S3 as the primary data store over traditional databases for user data (watchlists, portfolios, preferences, memories, skills):

flowchart LR UserWrite[User Write] --> S3Store["S3 - Source of Truth"] S3Store --> SNSTrigger[SNS Topic] SNSTrigger --> SyncLambda[fs-sync Lambda] SyncLambda --> PostgreSQL["PostgreSQL fs_files"] ListQuery[List Queries] --> PostgreSQL SingleRead[Single-Item Reads] --> S3Store EventBridge[EventBridge 3hr] --> ReconcileLambda[fs-reconcile Lambda] ReconcileLambda --> PostgreSQL

Rationale:

Factor	S3 Advantage
Durability	11 nines (99.999999999%)
Versioning	Built-in audit trails
Simplicity	YAML files are human-readable, debuggable
Cost	Lower than equivalent database storage

Sync Architecture:

Real-time: Lambda triggered by S3 events via SNS performs upsert/delete
Reconciliation: EventBridge-scheduled Lambda (every 3 hours) performs full S3-to-DB scan
Conflict resolution: Timestamp guards ensure newer data wins

User memories are stored as markdown files (/private/memories/UserMemories.md) that users can edit directly. These are injected as context on every conversation:

<user-memories>
{user_memories}
</user-memories>

Temporal for Long-Running Tasks

The article describes Temporal as transformative for handling multi-minute agent workflows:

Problem Context:

Company analysis tasks may require 5+ minutes
Server restarts, tab closures, and network issues interrupt homegrown job queues
State management and retry consistency were problematic

Temporal Benefits:

Capability	Implementation
Automatic Retry	Worker crash triggers automatic retry on another worker
State Persistence	Workflow state survives infrastructure failures
Cancellation	Heartbeat-based cancellation handling

Worker Configuration:

Worker Type	Concurrent Activities	Purpose
Chat	25	User-facing requests
Background	10	Async tasks

Cancellation requires heartbeats sent every few seconds, with the activity checking cancellation status between heartbeats.

Data Engineering: Context as Product

The Normalization Challenge

The article asserts that “your agent is only as good as the context it can access” and that the “real work isn’t prompt engineering—it’s turning messy financial data from dozens of sources into clean, structured context.”

Source Heterogeneity:

Source Type	Format Characteristics
SEC Filings	HTML with nested tables, exhibits, signatures
Earnings Transcripts	Speaker-segmented text with Q&A sections
Press Releases	Semi-structured HTML from PRNewswire
Research Reports	PDFs with charts and footnotes
Market Data	Structured numerical data from Snowflake/databases
News	Articles with varying quality and structure
Alternative Data	Satellite imagery, web traffic, credit card panels
Broker Research	Proprietary PDFs with price targets and models
Fund Filings	13F holdings, proxy statements, activist letters

Normalization Output:

All sources are converted to three formats:

Markdown: Narrative content (filings, transcripts, articles)
CSV/Tables: Structured data (financials, metrics, comparisons)
JSON Metadata: Searchability attributes (tickers, dates, document types, fiscal periods)

Chunking Strategy

Different document types require different chunking approaches:

Document Type	Chunking Strategy
10-K Filings	By regulatory section (Item 1, 1A, 7, 8…)
Earnings Transcripts	By speaker turn (CEO, CFO, Q&A by analyst)
Press Releases	Single chunk (typically small enough)
News Articles	Paragraph-level chunks
13F Filings	By holder and position changes quarter-over-quarter

The article notes that “chunking strategy determines what context the agent retrieves. Bad chunks = bad answers.”

Table Handling

LLMs demonstrate strong reasoning capability over markdown tables but perform poorly on raw HTML <table> tags or CSV dumps. The normalization layer converts all tabular data to clean markdown format.

Metadata-Enabled Retrieval

Every document includes meta.json for structured metadata enabling filtered retrieval:

flowchart TD UserQuery[User Query About Apple Services Revenue] UserQuery --> TickerResolve["Ticker Resolution - AAPL"] UserQuery --> DocFilter["Document Type - Earnings Transcript"] UserQuery --> TimeFilter["Temporal Filter - Most Recent"] UserQuery --> SectionTarget["Section Target - CFO Remarks"] TickerResolve --> MetadataQuery[Metadata Query] DocFilter --> MetadataQuery TimeFilter --> MetadataQuery SectionTarget --> MetadataQuery MetadataQuery --> ContextRetrieval[Relevant Context Retrieved]

The Parsing Problem

SEC Filing Adversarial Characteristics

SEC filings are characterized as “not designed for machine reading” but “designed for legal compliance”:

Challenge	Description
Multi-page tables	Tables span pages with repeated headers
Circular references	Footnotes reference exhibits referencing footnotes
Inconsistent numbers	Same figures appear differently in text vs. tables
Unreliable XBRL	Tags often wrong or incomplete
Format variance	Each law firm uses different templates

Off-the-shelf parser failures:

Multi-column layouts in proxy statements
Nested tables in MD&A sections (tables within tables)
Watermarks and headers bleeding into content
Scanned exhibits (still common in older filings)
Unicode issues (curly quotes, em-dashes, non-breaking spaces)

Parsing Pipeline

flowchart TD RawFiling[Raw Filing HTML/PDF] RawFiling --> DocStructure[Document Structure Detection] DocStructure --> TableExtract[Table Extraction with Cell Relationships] TableExtract --> EntityExtract[Entity Extraction] EntityExtract --> CrossRef[Cross-Reference Resolution] CrossRef --> FiscalNorm[Fiscal Period Normalization] FiscalNorm --> QualityScore[Quality Scoring] QualityScore --> CleanContext[Clean Context for Agent]

Table Extraction Complexity

Financial tables contain dense semantic information:

Table Element	Challenge
Merged headers	Cells spanning multiple columns
Footnote markers	(1), (2), (a), (b) referencing explanations below
Negative notation	$(1,234) means -1234
Mixed units	Millions for revenue, percentages for margins
Restatements	Prior period adjustments in italics or with asterisks

Quality Scoring Dimensions:

Cell boundary accuracy (split/merge correctness)
Header detection (row 1 vs. title row above)
Numeric parsing (text vs. parsed number)
Unit inference (millions, billions, per share, percentage)

Tables scoring below 90% confidence are flagged for review and excluded from agent context.

Fiscal Period Normalization

“Q1 2024” is ambiguous without company context:

Company	Q1 2024 Means	Fiscal Year End
Calendar	January-March 2024	December
Apple	October-December 2023	September
Microsoft	July-September 2023	June

The system maintains fiscal calendars for 10,000+ companies, normalizing all date references to absolute date ranges. Without this normalization, comparisons would conflate non-comparable periods.

Agent Design: Skills Architecture

Skills as First-Class Citizens

The article’s central agent design thesis: “the model is not the product. The skills are now the product.”

Without skills, frontier models “know what DCF is” and “can explain the theory” but produce subtly incorrect output when executing: missing critical steps, using wrong discount rates, forgetting stock-based compensation add-backs, skipping sensitivity analysis.

Skill Structure (Anthropic Agent Skills Specification, October 2025):

A skill is a folder containing SKILL.md with YAML frontmatter plus supporting files:

# dcf

## When to Use
Use this skill for discounted cash flow valuations.

## Instructions
1. Deep dive on the company using Task tool
2. Identify industry and load industry-specific guidelines
3. Gather financial data: revenue, margins, CapEx, working capital
4. Build DCF model in Excel using xlsx skill
5. Calculate WACC using industry benchmarks
6. Run sensitivity analysis on WACC and terminal growth
7. Validate: reconcile to actuals, compare to market price
8. Document view vs market pricing

## Industry Guidelines
- Technology/SaaS: `/public/skills/dcf/guidelines/technology-saas.md`
- Healthcare/Pharma: `/public/skills/dcf/guidelines/healthcare-pharma-biotech.md`
- Financial Services: `/public/skills/dcf/guidelines/financial-services.md`
[... 10+ industries]

Skills vs. Code

Advantage	Explanation
Non-engineer authorship	Analysts and customers can create skills
No deployment	File changes take effect immediately
Auditability	Human-readable intent vs. opaque code

Shadowing System

Priority resolution: private > shared > public

Users can override default skills by placing custom versions in /private/skills/. The user’s version wins.

SQL-Based Skill Discovery

Skills are not mounted directly to the filesystem. Instead, metadata is queried from PostgreSQL:

SELECT user_id, path, metadata
FROM fs_files
WHERE user_id = ANY(:user_ids)
  AND path LIKE 'skills/%/SKILL.md'

Rationale:

Concern	SQL Solution
Token efficiency	Lazy loading—full docs loaded only when skill invoked
Access control	Three-tier access enforced at query time
Shadowing logic	Priority resolution via SQL vs. filesystem symlinks
Metadata filtering	Query YAML frontmatter without reading files

The author emphasizes: “Top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills.”

Planned Obsolescence

The article advises designing skills “knowing that future models will need less hand-holding.” Detailed step-by-step instructions needed today may become one-liners as model capabilities improve.

Strategy:

Write skills for current limitations
Delete skills when they become unnecessary
Build new skills for emerging harder problems
Prefer markdown over code for easier modification and deletion

The author predicts: “in two years, most of our basic skills will be one-liners.”

Filesystem Tools

Core Tool Set

Tool	Purpose
ReadFile	Handle complexity of various file formats
WriteFile	Create artifacts linked back to UI
Bash	Persistent shell access (180s timeout, 100K char limit)

Files written to /private/artifacts/ become clickable links in the UI via computer://user_id/artifacts/ protocol.

Bash as Exploration Tool

The article references Braintrust evaluation comparing SQL agents, bash agents, and hybrid approaches:

Approach	Accuracy	Trade-off
Pure SQL	100%	Missed edge cases
Pure Bash	Lower	Slower, more expensive, caught verification
Hybrid	Best	Bash for exploration, SQL for structured

The author advocates “full shell access in the sandbox” for exploration, verification, and ad-hoc data manipulation that complex financial tasks require.

Real-Time Streaming

Delta Updates Architecture

flowchart LR Agent[Agent] --> SSE[SSE Events] SSE --> Redis[Redis Stream] Redis --> API[API] API --> Frontend[Frontend]

Delta Operations:

Operation	Purpose
ADD	Insert object at index
APPEND	Append to string/array
REPLACE	Replace content
PATCH	Partial update
TRUNCATE	Remove content

Delta updates send “append these 50 characters” rather than “here’s the complete response so far” for efficiency.

Rich Content Streaming

Streamdown renders markdown progressively with custom plugins for:

Charts (progressive rendering)
Citations (linked to source documents)
Math equations (KaTeX)

Interactive Agent Workflows

AskUserQuestion tool enables mid-workflow user input:

sequenceDiagram participant Agent participant AgenticLoop as Agentic Loop participant UI participant User Agent->>AgenticLoop: AskUserQuestion tool call AgenticLoop->>AgenticLoop: Save state AgenticLoop->>UI: Present options UI->>User: Display choices User->>UI: Select option UI->>AgenticLoop: User choice AgenticLoop->>Agent: Resume with choice

This transforms agents “from autonomous black boxes into collaborative tools” where users validate key assumptions in high-stakes financial work.

Evaluation System

Domain-Specific Test Categories

The article describes ~2,000 test cases across categories. Generic NLP metrics (BLEU, ROUGE) fail for finance because “a response can be semantically similar but have completely wrong numbers.”

Ticker Disambiguation:

Input	Correct	Incorrect Alternative
”Apple”	AAPL	APLE (Appel Petroleum)
“Meta”	META	MSTR
”Delta”	DAL	Delta hedging (options)

Historical Ticker Changes:

Facebook → META (2021)
Google → GOOG/GOOGL restructure
Twitter → X

Queries about “Facebook stock in 2023” require understanding FB → META mapping.

Fiscal Period Testing:

“Last quarter” on January 15th means different things:

Company Type	”Last Quarter” Means
Calendar-year	Q4 2024
Apple	Q1 2025 (just reported)
Microsoft	Q2 2025 (mid-quarter)

200+ test cases cover period extraction.

Numeric Precision:

All equivalent: $4.2B, $4,200M, $4.2 billion, “four point two billion”

Fails: “4.2” without units (millions? billions? per share?)

Adversarial Grounding:

Fake numbers are injected into context alongside real sources. If the agent cites the planted fake ($50B) instead of the real 10-K figure ($94B), the test fails. 50 test cases specifically target hallucination resistance.

Eval-Driven Development

Every skill has companion eval
PR blocked if eval score drops >5%
DCF skill has 40 test cases covering WACC edge cases, terminal value sanity, SBC add-backs

Production Monitoring

Observability Stack

flowchart TD Error[Production Error] --> GitHubIssue[Auto-File GitHub Issue] GitHubIssue --> Context[Full Context] Context --> ConvID[Conversation ID] Context --> UserInfo[User Info] Context --> Traceback[Traceback] Context --> BraintrustLink[Braintrust Trace Link] Context --> TemporalLink[Temporal Workflow Link] PayingCustomer{Paying Customer} -->|Yes| HighPriority["priority:high label"] PayingCustomer -->|No| StandardPriority[Standard priority]

Model Routing

Query Complexity	Model	Rationale
Simple	Haiku	Cost-effective
Complex	Sonnet	Higher quality
Enterprise	Best	Always premium

Strategic Thesis

The Model Is Not the Product

The article concludes with a strategic framework:

“Anyone can call Claude or GPT. The API is the same for everyone. What makes your product different is everything else.”

Durable Moats:

Asset	Moat Characteristic
Financial data	Normalized decades of filings
Domain skills	Encoded expertise from analysts and customers
Real-time UX	Streaming, interactive workflows
User trust	Track record with professional investors
Domain knowledge	Time spent with customers understanding needs

RAG to Agentic Search Transition

The author references a prior “RAG obituary” article and describes retiring embedding pipelines in favor of fully agentic search. This architectural shift was informed by discussions with Anthropic’s Claude Code team about “filesystem-first agentic approach.”

The claim that “most startups are adopting these best practices” after initial skepticism suggests this represents an emerging consensus in agent architecture.

Key Findings

Sandboxed execution with user-isolated environments is mandatory for multi-step agent workflows executing arbitrary code
S3-first architecture with PostgreSQL sync provides superior durability, versioning, and cost characteristics for user data
Temporal workflows solve long-running task reliability with automatic retry and proper cancellation handling
Context normalization (converting heterogeneous financial data to clean markdown/CSV/JSON) represents the majority of engineering work
SEC filing parsing requires custom pipelines handling adversarial document characteristics; off-the-shelf parsers fail on edge cases
Markdown skills encode domain expertise and represent the durable product; models are commoditizing
Domain-specific evaluation (~2,000 test cases) catches errors that generic NLP metrics miss
The competitive moat lies in data, skills, UX, and domain expertise—not model access

References

Nicolas Bustamante Twitter Thread - 2026-01-26
Anthropic Agent Skills Specification - October 2025
Temporal Workflow Engine
Braintrust Evaluation Platform
AWS ABAC Documentation