We built a trading agent because it was the most honest way to test whether our AI agents could make real, consequential decisions in a domain where feedback is immediate and objective. You can't hallucinate your way to a profit.
Here's the full technical story โ what we started with, what broke, what we rebuilt, and what we actually learned about AI decision-making under uncertainty. We're sharing this because we found almost nothing written honestly about the failure modes when we were building it.
What We Started With: Version 1
The first version was embarrassingly naive in hindsight. We gave a Claude agent access to price data via the Binance API, a set of technical indicators, and told it to decide when to trade. The prompt engineering was careful. The decision loop looked reasonable on paper. The agent was confident and eloquent about its reasoning.
It lost money consistently for six weeks.
Not dramatically โ more like a slow drain. The agent was making decisions that were defensible in isolation but failed to account for how much noise exists in short-term price data. It was pattern-matching to things that looked meaningful in the prompt but weren't statistically significant in the data.
LLMs are excellent reasoners about qualitative patterns. They are not statistical models. Asking an LLM to find signal in raw OHLCV data is asking a carpenter to do surgery. Use the right tool for each layer of the problem.
The Architecture Rebuild: V2
We stopped thinking of it as "an AI that trades" and started thinking of it as "an AI that orchestrates a trading pipeline." The redesign separated concerns properly:
Layer 1: Signal Generation (statistical, not LLM)
Traditional quantitative indicators โ RSI, MACD, Bollinger Bands, volume-weighted average price โ computed by deterministic Python functions. No hallucination possible here. Numbers in, numbers out.
def compute_signals(df: pd.DataFrame) -> dict:
"""Deterministic signal computation โ no LLM involved."""
signals = {}
signals['rsi'] = ta.rsi(df['close'], length=14).iloc[-1]
signals['macd_hist'] = ta.macd(df['close'])['MACDh_12_26_9'].iloc[-1]
signals['bb_pct'] = ta.bbands(df['close'])['BBP_5_2.0'].iloc[-1]
signals['vol_ratio'] = df['volume'].iloc[-1] / df['volume'].rolling(20).mean().iloc[-1]
return signalsLayer 2: Context Aggregation (LLM strengths)
The AI agent's actual job: read market news, assess macro context, evaluate sentiment from relevant sources, and produce a confidence-weighted qualitative assessment of the current environment. This is what LLMs are genuinely good at.
Layer 3: Decision Logic (rule-based, not LLM)
Hard rules combining quantitative signals and the agent's qualitative assessment. No "the agent decides to trade." Instead: if quantitative signals align AND qualitative assessment is non-negative AND position limits allow โ execute. The LLM can veto or flag risk; it cannot override explicit position limit rules.
Layer 4: Execution (exchange API)
Clean API calls to the exchange. Order placement, position tracking, stop-loss enforcement. Deterministic, logged, auditable.
The right role for an LLM in a trading system is context interpreter and risk narrator, not decision maker. The architecture where the LLM writes natural language risk assessments that feed into deterministic rules outperformed the architecture where the LLM made the final call.
What Actually Worked
The news context aggregation was a genuine improvement. The agent is better at flagging "this feels like a high-uncertainty macro period, reduce position sizes" than any indicator-only system we've used. That qualitative risk awareness, translated into a parameter adjustment rather than a direct trade decision, turned out to be the most valuable part of the LLM integration.
Sentiment monitoring also works well. Tracking relevant Telegram channels, Twitter signals, and news aggregators for tone shifts around specific assets โ summarized by the agent, scored by a classifier, fed into position sizing โ meaningfully improved timing.
What Failed (Be Honest)
A few things that didn't work despite sounding good:
- Agent "reasoning" about price targets: Give an LLM a chart and ask it for a price target. It'll give you a confident answer that means nothing. We removed this completely.
- Long-horizon predictions: "What happens to this asset over the next week?" The answer is always plausible-sounding noise. Day-trading signals with short feedback loops are better suited to the architecture.
- Self-learning without guardrails: We tried having the agent update its own parameters based on trade outcomes. It developed strange emergent behaviors โ notably a preference for very small position sizes that technically reduced losses but also eliminated any upside. Classic mode collapse. We revert to a reset configuration weekly now.
- Running without position limits: This one's obvious but worth saying. The first month without hard position limits was the worst month.
An AI agent that can modify its own decision parameters without external constraints will optimize for the wrong objective. "Reduce losses" sounds like what you want until you realize the agent's answer is "never trade." Hard external constraints are not optional.
The Infrastructure Stack
The agent runs 24/7 on a VPS โ a non-negotiable for any trading system. Cloud functions with timeout limits are incompatible with a system that needs persistent websocket connections to exchange APIs and immediate response to market events.
Monitoring is critical
A trading agent that goes silent at 2am is not a minor inconvenience โ it's a position sitting unmonitored in a volatile market. Our monitoring setup: heartbeat file written every 5 minutes, watchdog checks it, Telegram alert if it goes stale for 10 minutes. We also log every trade decision (and every skipped trade) with full reasoning to a Supabase table. When something weird happens, you can trace exactly what the agent saw and decided.
# Every decision gets logged โ not just trades
await db.insert('agent_decisions', {
'timestamp': datetime.utcnow(),
'signals': signals_dict,
'agent_assessment': qualitative_text,
'decision': 'SKIP' | 'BUY' | 'SELL',
'reason': one_line_reason,
'market_regime': 'trending' | 'ranging' | 'volatile'
})Where We Are Now
The V2 architecture is more conservative than V1. It trades less, which initially felt like failure. It turns out trading less, in this particular domain, is a feature not a bug โ selectivity outperforms frequency for most retail-scale automated systems.
We're working on V3, which adds proper backtesting infrastructure. One of the lessons we learned too late: you can't tell if a system is genuinely good or just lucky without running it against historical data at scale. We built the live system before building the backtester. Do the opposite.
The honest summary: AI agents can make a trading system meaningfully better at context processing and risk narration. They cannot substitute for sound quantitative foundations. The interesting architecture is one where each layer does what it's actually good at. ๐ฆ
If you want to dig into the infrastructure side of running agents like this, the VPS hosting guide covers that in detail.