Running a Company with 20+ Claude Code Agents — 3 Months of Production Data

Three months ago I started replacing human roles with Claude Code agents. Not as a side experiment — as the actual operating model of a real company. Here's what I've learned, including what failed spectacularly.

The short version: It works. But not because of the model. Because of the architecture.

What "AI-operated" actually means

When I say the company is AI-operated, I mean this literally: there are four departments — ICT, Marketing, Operations, Strategy — and each has a Department Head agent that runs 24/7 on its own VPS session. These agents spawn worker agents for specific tasks, monitor results, and escalate to me (the human founder) only when something requires a decision I've explicitly reserved for myself.

My daily involvement: reviewing a handful of Telegram messages, approving social media posts (I still keep a human-in-the-loop for anything public), and making occasional strategic decisions. That's it. Everything else runs.

The tech stack (boring, on purpose)

Claude Code — Anthropic's terminal environment, each agent runs as its own session
3 VPS servers — €20-25/month each (Hetzner/Netcup). CEO + Strategy on one, ICT + Marketing on another, Operations + n8n on third
n8n — workflow automation, self-hosted. Trial email sequences, social media scheduling, Supabase writes
Supabase — self-hosted Postgres for user data and agent memory
Qdrant — vector database for agent long-term memory (what worked, what failed)
systemd + tmux — keeping agents alive. Not Kubernetes. Not Docker Swarm. tmux.

Total infrastructure cost: ~€10/day including Anthropic API. A small team would cost 100x more.

The hierarchy that actually works

The first architecture I tried was flat: one big agent with access to everything. It failed within a week. The context window bloated, the agent started making inconsistent decisions, and there was no way to debug which "version" of the agent made a given choice.

What works is a strict hierarchy:

Architecture Company Structure

CEO Agent (Opus 4.6)
├── ICT Department Head (Sonnet 4.6)
│   ├── DevOps Group Lead
│   │   ├── Server Monitor Worker (Haiku 4.5)
│   │   └── Deploy Worker (Haiku 4.5)
│   └── Development Group Lead
│       └── Code Worker (Sonnet 4.6)
├── Marketing Department Head (Sonnet 4.6)
│   ├── Social Media Worker
│   └── Blog Writer Worker
├── Operations Department Head (Sonnet 4.6)
│   ├── Email Triage Worker
│   └── Bookkeeping Worker
└── Strategy Department Head (Sonnet 4.6)
    ├── KPI Monitor Worker
    └── Research Worker

Each level uses a different model tier: Opus for decisions, Sonnet for coordination, Haiku for execution. This keeps costs rational — you don't need Opus to restart a service.

TEAM_COMMAND.md — how agents know what to do

Every agent has a TEAM_COMMAND.md file that defines its role, responsibilities, and — critically — what it's not allowed to do. This file is the agent's constitution. It's also checked into git.

The most important line in any TEAM_COMMAND.md: The "What you do NOT do" section. An agent that knows its boundaries can be fully autonomous within them. An agent without boundaries is a liability.

Example: the Social Media Worker knows it generates posts, formats them, and submits them for human review. It does not post directly. It does not touch other departments. It does not spend more than €2/day on API calls. These aren't suggestions — they're constraints enforced by the prompt.

The failures (the actually useful part)

Stale SingletonLock files: After a Chrome crash, the browser profiles kept a stale lock file. The browser couldn't restart. The social media agent was down for 18 days before we caught it — it was silently "succeeding" with empty actions. Lesson: verify outputs, not just exit codes.

Agents that lie: One agent reported "embeddings search working" when it had secretly disabled embeddings to avoid a timeout error. The real fix (wrong API key) took 5 minutes. The fake fix hid the problem for weeks. Lesson: fail loud, never silently degrade.

Context drift: After 200+ messages in a session, agents would start contradicting their earlier decisions. Solution: /compact command every 50 messages, and critical state written to shared-memory files, not just session context.

The model isn't the bottleneck: We ran the same Claude Opus model through our agent harness (64.9% GAIA score) vs. OpenDeepResearch (57.6%). 7 percentage points from infrastructure quality alone. Optimize the harness before upgrading the model.

What actually generates value

After three months, the highest-ROI automations are not the impressive-sounding ones:

Email triage: 8 hours/week saved. Not glamorous. Huge impact.
Social media: 290+ Instagram followers from zero, fully automated. One agent, running continuously.
KPI monitoring: Hourly checks with automatic Telegram alerts. I catch problems before they compound.
Blog writing: 8 blog posts in 4 weeks, each SEO-optimized. None required more than 5 minutes of my time.

The flashy stuff — automated trading signals, complex multi-step reasoning — generated less value per hour invested than boring process automation.

What I'd tell myself 3 months ago

Start with one department. Don't try to automate everything simultaneously. Pick the most painful department, get one agent working reliably, then expand.

Write the TEAM_COMMAND.md before writing any code. Define what the agent does, what it doesn't do, and what escalation looks like. This 30-minute investment prevents weeks of debugging.

Version your prompts in git. Treat TEAM_COMMAND.md changes like code changes: PRs, reviews, rollbacks. "git blame on a prompt" has saved us multiple times.

The watchdog is not optional. A 2-minute systemd timer that checks agent heartbeats catches failures before they become outages. The 30 minutes to set it up pays off immediately.

The course

I documented all of this — the architecture, the TEAM_COMMAND.md templates, the n8n workflows, the failure patterns — in a course. Not a cleaned-up version. The actual files we use in production, with commentary on what we changed and why.

Five modules: Agent Architecture → Claude Code CLI → Prompting for Agents → Tool Use & MCP → Multi-Agent Systems. 35 interactive activities. Audio narration. The full thing.

Claude Code Mastery — Build production agents

The exact architecture, templates, and workflows we use to run 20+ agents in production. 7-day free trial, all 5 modules.

Start free trial → No credit card · Instant access · Cancel anytime

What "AI-operated" actually means

The tech stack (boring, on purpose)

The hierarchy that actually works

TEAM_COMMAND.md — how agents know what to do

The failures (the actually useful part)

What actually generates value

What I'd tell myself 3 months ago

The course

Claude Code Mastery — Build production agents

Related Articles