I Swear I’m Not Advertising For Anthropic

Kevin Hurley

Mar 23, 2026

13 min read

I've had several folks ask about what my AI dev workflow looks like, so I thought I'd share a bit about how I use it, what I've built to support it, and what my learnings have been. For context, I spend more on AI tokens each week than some people spend on rent, and typically run multiple agents doing different tasks 24/7. It's fair to say I use a lot of AI right now.

What I'll describe is my workflow and what works for me. But I honestly think everyone will use AI best in their own way, and no single workflow will work for everyone. I wouldn't be shocked if a year from now, everyone has their own variations of a lot of these tools. That's probably how it should be.

We're a Small Team. The Ambitions Are Not Small.

At Lightspark, we're building the infrastructure for open, interoperable payments — real-time, global, built on Bitcoin, Lightning and Stables. The kind of thing where each week really does matter, each bug really does matter, and the window to establish ourselves before the competition arrives is finite and closing.

We truly only get one shot at a lot of this. A small team with the right leverage can move faster than organizations ten times our size — but only if we're relentless about how we spend our time. When AI coding tools started getting genuinely useful, my gut was that this wasn't a "nice to have." It was the single biggest force multiplier available to a team like ours.

One Agent Is a Bottleneck

A single Claude or Codex session is useful, but it's also slow. You prompt, you wait, you review, you prompt again. It's a linear workflow in an era where the bottleneck is human attention, not compute. You're just sitting there while the model thinks, and that idle time is killing your throughput.

The real unlock is parallelism. For independent features, I use separate worktrees, and that works great — full isolation, minimal conflicts. But sometimes you want multiple agents working on the same medium to large feature, and the moment you try that, things can start going sideways. Agents overwrite each other's changes. They conflict on files. They make contradictory decisions about the same interface. It's like hiring ten contractors, giving them all keys to the same house, and then going to lunch.

I needed a foreman. So I built one.

Enter Hurlicane

Hurlicane — (formerly "Hurliwind," yes, naming is hard as evidenced by clawdbot > moltbot > openclaw) is an orchestrator for running many Claude Code and Codex sessions in parallel. It's a full web UI where I can see all running agents as real-time cards, stream their terminal output live, view diffs, manage job queues, and crucially keep agents from trampling each other's work.

Some of you might be thinking, "isn't this just Gastown?" Fair question. I'm sure Gastown works well for a lot of folks, but it just wasn't my vibe. I didn't want to spend a day learning all their terms. Mayors? Polecats? Refineries? Rigs? I just wanted to ship code faster.

The philosophical difference is real, though. Gastown is agent-driven. An AI "Mayor" decomposes your work autonomously. Hurlicane keeps you in the driver's seat. You design the workflows, trigger the debates, and decide what runs when. I think there's a strong argument that humans should still be orchestrating the work and AI should be executing it.

Here's a quick rundown of what Hurlicane has that Gastown doesn't:

Full web UI with real-time agent cards, live terminal streaming, and a diff viewer
Structured multi-round debates where Claude and Codex argue with consensus detection, post-debate actions, and verification loops
Model auto-classification that routes tasks to the right model automatically
Batch templates for running parameterized job sets across many agents
Built-in cost tracking dashboard (trust me, you need this at $3K/week)
File lock system with blocking acquisition and DFS-based deadlock detection
Knowledge base that persists learnings across jobs and consolidates them automatically
Interactive sessions where you type directly into a running agent's terminal

What Gastown has that we don't: agent-driven autonomous decomposition, git-backed state with no database dependency, persistent agent identity across repos, cross-repo coordination, and a built-in merge queue. Different tools for different brains. Use what works for you.

How I Actually Build Features

I want to walk through this in detail because I think the specific patterns matter more than the tool itself. Anyone can build an orchestrator. The hard part is figuring out the workflow that actually produces good code.

1. Start with the Design, Not the Code

I start every project by iteratively working on the product scope, going back and forth with Claude or Codex. You will rarely one-shot large features, Claude is extremely lazy and will do the minimal work without thinking about edge cases. Your job is to curb Claude's natural laziness and force it to think harder.

If you just input one prompt and tell Claude to build something, you're going to have a lot of problems. Talk to Claude like you're brainstorming a design with a human. Ask about how certain edge cases are covered. Think deeply yourself about the design and probe in the areas you would probe if a junior engineer came to you and wanted input on their architecture. What happens if this service dies? What if the upstream API changes its contract? What about the race condition between these two writes?

Once you get to a place where you feel like the bones of a design are solid, tell Claude to document the design in a markdown file; it'll put these in a new worktree. In Hurlicane, just check the "use worktree" button when you create a new job, and it will stick it in an isolated git worktree automatically. Clean separation. No merge conflicts with your mainline work.

2. Let Claude and Codex Argue

This is probably my favorite feature and the one I think has the most potential to change how people think about AI-assisted development.

Hurlicane has a structured debate system where Claude and Codex argue with each other about whether a design or implementation is complete and correct. Neither model individually finds all the issues with code, but combined, they do a pretty solid job. They each catch things the other misses. It's like having two senior engineers who've never worked together review each other's thinking. The disagreements are where the gold is.

Under the hood, it's a state machine: Claude evaluates, Codex evaluates, they compare verdicts. If they agree, consensus detection checks both semantic similarity and explicit agreement — they move on. If they disagree, the debate advances to the next round with the full context of their disagreement. You can set how many loops to run (say 3), and they'll go through the full cycle: debate the issue, make the fixes, validate the fixes, then start over. They keep going until both models agree that everything is solid, or they exhaust their rounds.

They also find things that aren't true. So in my post-debate follow-up, I ask them to write a test that first proves it's an issue, then fix it, then prove the test now passes. Evidence before assertions, always.

I'll debate the design first until both models are satisfied. Then I move to implementation.

3. Build Incrementally, Test Relentlessly

I have Claude implement things test-driven, one flow at a time. I'll often iteratively build portions of the system one flow at a time, and include the specific things I want Claude to test.

After each phase, I run what I call batches. These are stored sets of checks, one per agent, so no single context window gets overloaded trying to do too much. For example, my code quality batch includes checks like:

"Check for over-engineered areas"
"Check for DRY violations"
"Code comments should describe why, not how"
"Look for missing error handling at system boundaries"
"Verify test coverage on the happy path and the two most likely failure modes"

Each check runs as its own agent. I can run the whole batch as a debate too, just check a box.

Then I kick off another Claude-vs-Codex debate on the full implementation. I'll set loops to maybe 3, and they iterate: debate the code, make fixes, validate the fixes, start over. They keep going for however many loops I set, unless they both agree that everything is solid.

I'll keep doing this flow until I've implemented everything for the feature. For myself, I've found it more helpful to build everything out and then work on manual testing. I'll generally ask Claude to write up a local tool I can use to test the feature end-to-end. Even with a lot of automated testing and back-and-forth along the way, there are often still issues that only surface with real manual testing. If you solely develop one PR at a time on a big feature, you'll miss these things.

4. Parallelize Aggressively

In the AI world, you're constantly waiting for things. How quickly you move depends on how well you're able to context-switch. I always have a queue of things I want to do next so that I'm never left sitting idle, and I keep a running list of notes for what's coming.

Hurlicane supports job dependencies, so I can enqueue tasks that need to happen serially and they'll only execute when their parent task finishes. Meanwhile, independent work runs in parallel across other agents. The job lineage panel gives me a DAG view of the whole dependency graph, so I can see exactly what's blocked on what.

A big problem with parallelizing work is that agents WILL trample each other and overwrite another agent's changes. To solve this, Hurlicane has a file lock system built as a Claude Code PreToolUse hook, so only one agent can mutate a file at a time. Before any agent writes to a file, the hook checks that the agent holds the lock. If it doesn't, the write is blocked, the agent queues up, and it automatically resumes when the lock becomes available. There's even DFS-based cycle detection in the wait-for graph to prevent deadlocks. This is actually a pretty hard problem, but it's the kind of problem you absolutely have to solve if you want parallel agents on a shared codebase to work at all.

5. Automate the Annoying Stuff

You can automate away the annoying parts of your job. I have templates where I can just press a button and it will:

Check all of my existing PRs for comments that I haven't seen yet, group them by PR, think through whether each comment is valid, and recommend what we should do to resolve it
Look at the CI/CD of all my open PRs and kick off agents to fix any failures
Run any of these on a recurring schedule — I can have it automatically run every hour, for example

Which brings me to my favorite observation from this whole journey: engineering managers are now code review sub-agents. Congratulations on the promotion. Or is it a demotion? I'll let you debate that — in Hurlicane, of course, with consensus detection and a 3-loop verification cycle.

(I'm kidding. Mostly. The agents do a VERY thorough job, though — iirc they've caught things in review that humans missed on the first pass. More than once.)

The Nerdy Bits

For the folks who want the technical details — and I know you do — here's how the system actually works.

Agent Spawning & Crash Recovery: Each agent runs as a detached subprocess (claude --print --output-format stream-json) that becomes a process group leader — meaning it survives server restarts. We auto-detect Python virtual environments, set process priority with nice -n 10, and pipe output to .ndjson files. If the server crashes and comes back, it re-attaches to the mid-stream agent output. Interactive agents get full tmux + PTY sessions with bidirectional terminal access from the web UI — you can literally type into a running agent's terminal from your browser.

Smart Model Routing: A lightweight Haiku classifier evaluates each task's complexity in real-time and routes it to the appropriate model. Simple tasks go to Haiku, medium to Sonnet, complex to Opus. My gut is that most people are wasting money sending trivial tasks to Opus. You can override this per-job, but the auto-routing saves real money and keeps the fast stuff fast. The downside is an extra API call for classification, but it pays for itself many times over.

Knowledge Base with Contradiction Detection: Agents call search_kb at task start and report_learnings when they're done. Entries are stored in an FTS5 SQLite virtual table for full-text search. A consolidation cycle runs every 6 hours to prune stale entries, deduplicate, and (this is my favorite part) detect contradictions between learnings. If one agent learned "X is broken" and another learned "X works fine," the consolidator flags it and resolves it. Knowledge that just accumulates without curation becomes noise. We actively prevent that.

MCP Integration: Hurlicane exposes 18+ tools through a Model Context Protocol server: file locking, job creation, shared notes for inter-agent communication, knowledge base, and integrations with Linear, OpenSearch, and Postgres for queries. Each agent gets its own MCP session, and orphaned waits (where an agent disconnects while waiting for a dependency) auto-recover when the dependency completes.

The Eye: This is our experimental continuous improvement agent. It runs in a cycle (orient, discover, analyze, verify, propose, execute, review, record), finding potential improvements in the codebase autonomously. The key insight: every finding gets independently verified by a second model before it becomes a proposal. Codex verifies Claude's findings and vice versa. Confidence scoring is weighted 60/40 between discoverer and verifier. Nothing gets executed without human approval. It's like having an engineer who's always doing a code audit in the background, except this one never gets tired and never gets distracted by Slack.

Where Humans Still Matter Most

I want to be honest about where this breaks down, because the hype around AI-assisted development glosses over the parts that actually matter.

Payment flows and financial logic still get manual review from senior engineers, full stop. When you're moving real money, especially in our world of Bitcoin, Lightning, and Stables, a subtle bug isn't just a bad UX. It's someone's money disappearing. I can only imagine how a partner would feel if they discovered a rounding error in fee calculations that had been silently eating their margin for weeks. We're talking about systems where off-by-one errors can mean off-by-a-million-dollars. The agents are great at scaffolding, and the debates catch a lot — but a human who deeply understands the payment domain reviews every line that touches money movement, fee calculation, or cryptographic operations.

Crypto and security-sensitive code gets the same treatment. Key management, signature verification, consensus logic. These are areas where "the tests pass" is necessary but nowhere near sufficient. We pair AI speed with human paranoia, and I'd encourage anyone building in this space to do the same.

The pattern we've landed on: let agents build fast, debate each other, catch 90% of issues, then put experienced human eyes on the 10% that has the highest blast radius. It's not about replacing review. It's about focusing human attention where it actually matters.

It's Never Been a More Exciting Time to Build

I've been writing software for a long time. I have never, not once, been this productive. The gap between "idea" and "shipped" has never been this small. The model companies built the rocket, and now we're strapping on boosters.

Seriously though, think about what your team could accomplish if every engineer had a dozen AI agents working in parallel, coordinated properly, with file locks preventing conflicts and structured debates catching bugs before they hit production. What velocity could you ship at?

We've open-sourced Hurlicane so you can try all of this yourself. Fork it, break it, make it better. If it helps you ship faster, that's a win.

We're using Hurlicane to build the future of open, interoperable payments at Lightspark. Real-time global settlement. The kind of infrastructure that makes money move like information. This is our chance to make our mark on how Bitcoin is used and to enable use cases that were never possible before.

Come build with us:

Kevin Hurley is the CTO of Lightspark, where he spends an unreasonable amount of money on AI tokens and has strong opinions about how many agents should argue with each other before code gets merged.