Fintool AI Agent Architecture: Lessons from Financial Services
Research Date: 2026-01-26 Source URL: https://x.com/nicbstme/status/2015174818497437834 Author: Nicolas Bustamante (@nicbstme), Fintool
Reference URLs
- Original Twitter/X Thread
- Anthropic Agent Skills Specification (referenced October 2025)
- Braintrust Eval Platform (referenced for evaluation)
- Temporal Workflow Engine (referenced for long-running tasks)
Summary
This article documents architectural decisions and operational lessons from two years of building Fintool, an AI agent platform serving professional investors. The author presents eleven core lessons spanning infrastructure (sandboxed execution, S3-first storage, Temporal workflows), data engineering (context normalization, adversarial document parsing), agent design (markdown skills, filesystem tools), and operations (domain-specific evaluation, production monitoring).
The central thesis posits that competitive advantage in AI agent products derives not from model access but from surrounding infrastructure: data quality, domain-encoded skills, user experience, and operational reliability. The article advocates for designing systems with planned obsolescence as model capabilities improve, while building durable moats in data and domain expertise.
The financial services context imposes stringent accuracy requirements—errors in revenue figures or valuation assumptions can result in significant financial losses for users making investment decisions based on agent output.
Domain Context: Financial Services Constraints
Error Intolerance
Financial services represents a high-stakes domain where agent errors carry concrete consequences:
| Error Type | Potential Impact |
|---|---|
| Incorrect revenue figure | Misinformed investment thesis |
| Misinterpreted guidance | Incorrect earnings expectations |
| Wrong DCF assumption | Flawed valuation model |
| Fiscal period confusion | Comparing non-comparable quarters |
The author notes that professional investors “spot bullshit instantly” and require “precision, speed, and depth.” This creates what the author describes as “paranoid attention to detail” where every number receives validation and every assumption undergoes stress testing.
User Sophistication
Target users are described as “some of the smartest, most time-pressed people” who cannot accept hand-waving through valuation models or glossed-over nuances. This user profile drives the technical requirements discussed throughout the article.
Infrastructure Architecture
Sandboxed Execution Environments
The article argues that sandboxing is non-optional for multi-step agent workflows. The author recounts an incident where an LLM attempted to execute rm -rf / while “trying to clean up temporary files.”
Architecture Pattern:
Key Implementation Details:
| Component | Implementation |
|---|---|
| Access Control | AWS ABAC with ${aws:PrincipalTag/S3Prefix} restrictions |
| Credential Scope | Short-lived, scoped to specific S3 prefixes |
| Sandbox Lifecycle | 600-second timeout, extended 10 minutes per tool usage |
| Pre-warming | Sandbox initialization begins when user starts typing |
The three-tier mount system (private/shared/public) enables organizational data sharing while maintaining user isolation. IAM policies physically prevent cross-user data access.
S3-First Architecture
The article advocates for S3 as the primary data store over traditional databases for user data (watchlists, portfolios, preferences, memories, skills):
Rationale:
| Factor | S3 Advantage |
|---|---|
| Durability | 11 nines (99.999999999%) |
| Versioning | Built-in audit trails |
| Simplicity | YAML files are human-readable, debuggable |
| Cost | Lower than equivalent database storage |
Sync Architecture:
- Real-time: Lambda triggered by S3 events via SNS performs upsert/delete
- Reconciliation: EventBridge-scheduled Lambda (every 3 hours) performs full S3-to-DB scan
- Conflict resolution: Timestamp guards ensure newer data wins
User memories are stored as markdown files (/private/memories/UserMemories.md) that users can edit directly. These are injected as context on every conversation:
<user-memories>
{user_memories}
</user-memories>
Temporal for Long-Running Tasks
The article describes Temporal as transformative for handling multi-minute agent workflows:
Problem Context:
- Company analysis tasks may require 5+ minutes
- Server restarts, tab closures, and network issues interrupt homegrown job queues
- State management and retry consistency were problematic
Temporal Benefits:
| Capability | Implementation |
|---|---|
| Automatic Retry | Worker crash triggers automatic retry on another worker |
| State Persistence | Workflow state survives infrastructure failures |
| Cancellation | Heartbeat-based cancellation handling |
Worker Configuration:
| Worker Type | Concurrent Activities | Purpose |
|---|---|---|
| Chat | 25 | User-facing requests |
| Background | 10 | Async tasks |
Cancellation requires heartbeats sent every few seconds, with the activity checking cancellation status between heartbeats.
Data Engineering: Context as Product
The Normalization Challenge
The article asserts that “your agent is only as good as the context it can access” and that the “real work isn’t prompt engineering—it’s turning messy financial data from dozens of sources into clean, structured context.”
Source Heterogeneity:
| Source Type | Format Characteristics |
|---|---|
| SEC Filings | HTML with nested tables, exhibits, signatures |
| Earnings Transcripts | Speaker-segmented text with Q&A sections |
| Press Releases | Semi-structured HTML from PRNewswire |
| Research Reports | PDFs with charts and footnotes |
| Market Data | Structured numerical data from Snowflake/databases |
| News | Articles with varying quality and structure |
| Alternative Data | Satellite imagery, web traffic, credit card panels |
| Broker Research | Proprietary PDFs with price targets and models |
| Fund Filings | 13F holdings, proxy statements, activist letters |
Normalization Output:
All sources are converted to three formats:
- Markdown: Narrative content (filings, transcripts, articles)
- CSV/Tables: Structured data (financials, metrics, comparisons)
- JSON Metadata: Searchability attributes (tickers, dates, document types, fiscal periods)
Chunking Strategy
Different document types require different chunking approaches:
| Document Type | Chunking Strategy |
|---|---|
| 10-K Filings | By regulatory section (Item 1, 1A, 7, 8…) |
| Earnings Transcripts | By speaker turn (CEO, CFO, Q&A by analyst) |
| Press Releases | Single chunk (typically small enough) |
| News Articles | Paragraph-level chunks |
| 13F Filings | By holder and position changes quarter-over-quarter |
The article notes that “chunking strategy determines what context the agent retrieves. Bad chunks = bad answers.”
Table Handling
LLMs demonstrate strong reasoning capability over markdown tables but perform poorly on raw HTML <table> tags or CSV dumps. The normalization layer converts all tabular data to clean markdown format.
Metadata-Enabled Retrieval
Every document includes meta.json for structured metadata enabling filtered retrieval:
The Parsing Problem
SEC Filing Adversarial Characteristics
SEC filings are characterized as “not designed for machine reading” but “designed for legal compliance”:
| Challenge | Description |
|---|---|
| Multi-page tables | Tables span pages with repeated headers |
| Circular references | Footnotes reference exhibits referencing footnotes |
| Inconsistent numbers | Same figures appear differently in text vs. tables |
| Unreliable XBRL | Tags often wrong or incomplete |
| Format variance | Each law firm uses different templates |
Off-the-shelf parser failures:
- Multi-column layouts in proxy statements
- Nested tables in MD&A sections (tables within tables)
- Watermarks and headers bleeding into content
- Scanned exhibits (still common in older filings)
- Unicode issues (curly quotes, em-dashes, non-breaking spaces)
Parsing Pipeline
Table Extraction Complexity
Financial tables contain dense semantic information:
| Table Element | Challenge |
|---|---|
| Merged headers | Cells spanning multiple columns |
| Footnote markers | (1), (2), (a), (b) referencing explanations below |
| Negative notation | $(1,234) means -1234 |
| Mixed units | Millions for revenue, percentages for margins |
| Restatements | Prior period adjustments in italics or with asterisks |
Quality Scoring Dimensions:
- Cell boundary accuracy (split/merge correctness)
- Header detection (row 1 vs. title row above)
- Numeric parsing (text vs. parsed number)
- Unit inference (millions, billions, per share, percentage)
Tables scoring below 90% confidence are flagged for review and excluded from agent context.
Fiscal Period Normalization
“Q1 2024” is ambiguous without company context:
| Company | Q1 2024 Means | Fiscal Year End |
|---|---|---|
| Calendar | January-March 2024 | December |
| Apple | October-December 2023 | September |
| Microsoft | July-September 2023 | June |
The system maintains fiscal calendars for 10,000+ companies, normalizing all date references to absolute date ranges. Without this normalization, comparisons would conflate non-comparable periods.
Agent Design: Skills Architecture
Skills as First-Class Citizens
The article’s central agent design thesis: “the model is not the product. The skills are now the product.”
Without skills, frontier models “know what DCF is” and “can explain the theory” but produce subtly incorrect output when executing: missing critical steps, using wrong discount rates, forgetting stock-based compensation add-backs, skipping sensitivity analysis.
Skill Structure (Anthropic Agent Skills Specification, October 2025):
A skill is a folder containing SKILL.md with YAML frontmatter plus supporting files:
# dcf
## When to Use
Use this skill for discounted cash flow valuations.
## Instructions
1. Deep dive on the company using Task tool
2. Identify industry and load industry-specific guidelines
3. Gather financial data: revenue, margins, CapEx, working capital
4. Build DCF model in Excel using xlsx skill
5. Calculate WACC using industry benchmarks
6. Run sensitivity analysis on WACC and terminal growth
7. Validate: reconcile to actuals, compare to market price
8. Document view vs market pricing
## Industry Guidelines
- Technology/SaaS: `/public/skills/dcf/guidelines/technology-saas.md`
- Healthcare/Pharma: `/public/skills/dcf/guidelines/healthcare-pharma-biotech.md`
- Financial Services: `/public/skills/dcf/guidelines/financial-services.md`
[... 10+ industries]
Skills vs. Code
| Advantage | Explanation |
|---|---|
| Non-engineer authorship | Analysts and customers can create skills |
| No deployment | File changes take effect immediately |
| Auditability | Human-readable intent vs. opaque code |
Shadowing System
Priority resolution: private > shared > public
Users can override default skills by placing custom versions in /private/skills/. The user’s version wins.
SQL-Based Skill Discovery
Skills are not mounted directly to the filesystem. Instead, metadata is queried from PostgreSQL:
SELECT user_id, path, metadata
FROM fs_files
WHERE user_id = ANY(:user_ids)
AND path LIKE 'skills/%/SKILL.md'
Rationale:
| Concern | SQL Solution |
|---|---|
| Token efficiency | Lazy loading—full docs loaded only when skill invoked |
| Access control | Three-tier access enforced at query time |
| Shadowing logic | Priority resolution via SQL vs. filesystem symlinks |
| Metadata filtering | Query YAML frontmatter without reading files |
The author emphasizes: “Top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills.”
Planned Obsolescence
The article advises designing skills “knowing that future models will need less hand-holding.” Detailed step-by-step instructions needed today may become one-liners as model capabilities improve.
Strategy:
- Write skills for current limitations
- Delete skills when they become unnecessary
- Build new skills for emerging harder problems
- Prefer markdown over code for easier modification and deletion
The author predicts: “in two years, most of our basic skills will be one-liners.”
Filesystem Tools
Core Tool Set
| Tool | Purpose |
|---|---|
| ReadFile | Handle complexity of various file formats |
| WriteFile | Create artifacts linked back to UI |
| Bash | Persistent shell access (180s timeout, 100K char limit) |
Files written to /private/artifacts/ become clickable links in the UI via computer://user_id/artifacts/ protocol.
Bash as Exploration Tool
The article references Braintrust evaluation comparing SQL agents, bash agents, and hybrid approaches:
| Approach | Accuracy | Trade-off |
|---|---|---|
| Pure SQL | 100% | Missed edge cases |
| Pure Bash | Lower | Slower, more expensive, caught verification |
| Hybrid | Best | Bash for exploration, SQL for structured |
The author advocates “full shell access in the sandbox” for exploration, verification, and ad-hoc data manipulation that complex financial tasks require.
Real-Time Streaming
Delta Updates Architecture
Delta Operations:
| Operation | Purpose |
|---|---|
| ADD | Insert object at index |
| APPEND | Append to string/array |
| REPLACE | Replace content |
| PATCH | Partial update |
| TRUNCATE | Remove content |
Delta updates send “append these 50 characters” rather than “here’s the complete response so far” for efficiency.
Rich Content Streaming
Streamdown renders markdown progressively with custom plugins for:
- Charts (progressive rendering)
- Citations (linked to source documents)
- Math equations (KaTeX)
Interactive Agent Workflows
AskUserQuestion tool enables mid-workflow user input:
This transforms agents “from autonomous black boxes into collaborative tools” where users validate key assumptions in high-stakes financial work.
Evaluation System
Domain-Specific Test Categories
The article describes ~2,000 test cases across categories. Generic NLP metrics (BLEU, ROUGE) fail for finance because “a response can be semantically similar but have completely wrong numbers.”
Ticker Disambiguation:
| Input | Correct | Incorrect Alternative |
|---|---|---|
| ”Apple” | AAPL | APLE (Appel Petroleum) |
| “Meta” | META | MSTR |
| ”Delta” | DAL | Delta hedging (options) |
Historical Ticker Changes:
- Facebook → META (2021)
- Google → GOOG/GOOGL restructure
- Twitter → X
Queries about “Facebook stock in 2023” require understanding FB → META mapping.
Fiscal Period Testing:
“Last quarter” on January 15th means different things:
| Company Type | ”Last Quarter” Means |
|---|---|
| Calendar-year | Q4 2024 |
| Apple | Q1 2025 (just reported) |
| Microsoft | Q2 2025 (mid-quarter) |
200+ test cases cover period extraction.
Numeric Precision:
All equivalent: $4.2B, $4,200M, $4.2 billion, “four point two billion”
Fails: “4.2” without units (millions? billions? per share?)
Adversarial Grounding:
Fake numbers are injected into context alongside real sources. If the agent cites the planted fake ($50B) instead of the real 10-K figure ($94B), the test fails. 50 test cases specifically target hallucination resistance.
Eval-Driven Development
- Every skill has companion eval
- PR blocked if eval score drops >5%
- DCF skill has 40 test cases covering WACC edge cases, terminal value sanity, SBC add-backs
Production Monitoring
Observability Stack
Model Routing
| Query Complexity | Model | Rationale |
|---|---|---|
| Simple | Haiku | Cost-effective |
| Complex | Sonnet | Higher quality |
| Enterprise | Best | Always premium |
Strategic Thesis
The Model Is Not the Product
The article concludes with a strategic framework:
“Anyone can call Claude or GPT. The API is the same for everyone. What makes your product different is everything else.”
Durable Moats:
| Asset | Moat Characteristic |
|---|---|
| Financial data | Normalized decades of filings |
| Domain skills | Encoded expertise from analysts and customers |
| Real-time UX | Streaming, interactive workflows |
| User trust | Track record with professional investors |
| Domain knowledge | Time spent with customers understanding needs |
RAG to Agentic Search Transition
The author references a prior “RAG obituary” article and describes retiring embedding pipelines in favor of fully agentic search. This architectural shift was informed by discussions with Anthropic’s Claude Code team about “filesystem-first agentic approach.”
The claim that “most startups are adopting these best practices” after initial skepticism suggests this represents an emerging consensus in agent architecture.
Key Findings
- Sandboxed execution with user-isolated environments is mandatory for multi-step agent workflows executing arbitrary code
- S3-first architecture with PostgreSQL sync provides superior durability, versioning, and cost characteristics for user data
- Temporal workflows solve long-running task reliability with automatic retry and proper cancellation handling
- Context normalization (converting heterogeneous financial data to clean markdown/CSV/JSON) represents the majority of engineering work
- SEC filing parsing requires custom pipelines handling adversarial document characteristics; off-the-shelf parsers fail on edge cases
- Markdown skills encode domain expertise and represent the durable product; models are commoditizing
- Domain-specific evaluation (~2,000 test cases) catches errors that generic NLP metrics miss
- The competitive moat lies in data, skills, UX, and domain expertise—not model access