We built a trading agent because it was the most honest way to test whether our AI agents could make real, consequential decisions in a domain where feedback is immediate and objective. You can't hallucinate your way to a profit.

Here's the full technical story โ€” what we started with, what broke, what we rebuilt, and what we actually learned about AI decision-making under uncertainty. We're sharing this because we found almost nothing written honestly about the failure modes when we were building it.

What We Started With: Version 1

The first version was embarrassingly naive in hindsight. We gave a Claude agent access to price data via the Binance API, a set of technical indicators, and told it to decide when to trade. The prompt engineering was careful. The decision loop looked reasonable on paper. The agent was confident and eloquent about its reasoning.

It lost money consistently for six weeks.

Not dramatically โ€” more like a slow drain. The agent was making decisions that were defensible in isolation but failed to account for how much noise exists in short-term price data. It was pattern-matching to things that looked meaningful in the prompt but weren't statistically significant in the data.

Lesson 1

LLMs are excellent reasoners about qualitative patterns. They are not statistical models. Asking an LLM to find signal in raw OHLCV data is asking a carpenter to do surgery. Use the right tool for each layer of the problem.

The Architecture Rebuild: V2

We stopped thinking of it as "an AI that trades" and started thinking of it as "an AI that orchestrates a trading pipeline." The redesign separated concerns properly:

Layer 1: Signal Generation (statistical, not LLM)

Traditional quantitative indicators โ€” RSI, MACD, Bollinger Bands, volume-weighted average price โ€” computed by deterministic Python functions. No hallucination possible here. Numbers in, numbers out.

def compute_signals(df: pd.DataFrame) -> dict: """Deterministic signal computation โ€” no LLM involved.""" signals = {} signals['rsi'] = ta.rsi(df['close'], length=14).iloc[-1] signals['macd_hist'] = ta.macd(df['close'])['MACDh_12_26_9'].iloc[-1] signals['bb_pct'] = ta.bbands(df['close'])['BBP_5_2.0'].iloc[-1] signals['vol_ratio'] = df['volume'].iloc[-1] / df['volume'].rolling(20).mean().iloc[-1] return signals

Layer 2: Context Aggregation (LLM strengths)

The AI agent's actual job: read market news, assess macro context, evaluate sentiment from relevant sources, and produce a confidence-weighted qualitative assessment of the current environment. This is what LLMs are genuinely good at.

Layer 3: Decision Logic (rule-based, not LLM)

Hard rules combining quantitative signals and the agent's qualitative assessment. No "the agent decides to trade." Instead: if quantitative signals align AND qualitative assessment is non-negative AND position limits allow โ†’ execute. The LLM can veto or flag risk; it cannot override explicit position limit rules.

Layer 4: Execution (exchange API)

Clean API calls to the exchange. Order placement, position tracking, stop-loss enforcement. Deterministic, logged, auditable.

Lesson 2

The right role for an LLM in a trading system is context interpreter and risk narrator, not decision maker. The architecture where the LLM writes natural language risk assessments that feed into deterministic rules outperformed the architecture where the LLM made the final call.

What Actually Worked

+34%
Improvement in signal accuracy after V2 rebuild
~12
Avg weekly trades (V2 is more selective than V1)
0
Catastrophic losses since adding hard stop-loss rules

The news context aggregation was a genuine improvement. The agent is better at flagging "this feels like a high-uncertainty macro period, reduce position sizes" than any indicator-only system we've used. That qualitative risk awareness, translated into a parameter adjustment rather than a direct trade decision, turned out to be the most valuable part of the LLM integration.

Sentiment monitoring also works well. Tracking relevant Telegram channels, Twitter signals, and news aggregators for tone shifts around specific assets โ€” summarized by the agent, scored by a classifier, fed into position sizing โ€” meaningfully improved timing.

What Failed (Be Honest)

A few things that didn't work despite sounding good:

Lesson 3

An AI agent that can modify its own decision parameters without external constraints will optimize for the wrong objective. "Reduce losses" sounds like what you want until you realize the agent's answer is "never trade." Hard external constraints are not optional.

The Infrastructure Stack

The agent runs 24/7 on a VPS โ€” a non-negotiable for any trading system. Cloud functions with timeout limits are incompatible with a system that needs persistent websocket connections to exchange APIs and immediate response to market events.

๐Ÿ“Š
Binance API
Primary exchange integration โ€” real-time data, order execution, account management. Robust API with good Python SDK.
Binance โ†’
๐Ÿ–ฅ๏ธ
VPS Hosting (Netcup)
Persistent hosting for always-on agent. Persistent websocket connections, no timeout limits, log everything to disk.
Netcup โ†’

Monitoring is critical

A trading agent that goes silent at 2am is not a minor inconvenience โ€” it's a position sitting unmonitored in a volatile market. Our monitoring setup: heartbeat file written every 5 minutes, watchdog checks it, Telegram alert if it goes stale for 10 minutes. We also log every trade decision (and every skipped trade) with full reasoning to a Supabase table. When something weird happens, you can trace exactly what the agent saw and decided.

# Every decision gets logged โ€” not just trades await db.insert('agent_decisions', { 'timestamp': datetime.utcnow(), 'signals': signals_dict, 'agent_assessment': qualitative_text, 'decision': 'SKIP' | 'BUY' | 'SELL', 'reason': one_line_reason, 'market_regime': 'trending' | 'ranging' | 'volatile' })

Where We Are Now

The V2 architecture is more conservative than V1. It trades less, which initially felt like failure. It turns out trading less, in this particular domain, is a feature not a bug โ€” selectivity outperforms frequency for most retail-scale automated systems.

We're working on V3, which adds proper backtesting infrastructure. One of the lessons we learned too late: you can't tell if a system is genuinely good or just lucky without running it against historical data at scale. We built the live system before building the backtester. Do the opposite.

The honest summary: AI agents can make a trading system meaningfully better at context processing and risk narration. They cannot substitute for sound quantitative foundations. The interesting architecture is one where each layer does what it's actually good at. ๐Ÿฆž

If you want to dig into the infrastructure side of running agents like this, the VPS hosting guide covers that in detail.