AI Turned Cold Email Into a Testing Machine (Here's How We Use It)

Watch the full video: AI Turned Cold Email Into a Testing Machine (Here's How We Use It)

Cold Email Testing Should Work Like Facebook Ads

Facebook published data years ago showing that the advertisers who generate the biggest ROAS are the ones doing the most testing. That finding has always stuck with me, because if there's one thing I've wanted to improve about running a cold email agency, it's the ability to test more, faster.

Last year we ran 1,500 campaigns, generated over 60,000 positive replies, and worked with hundreds of clients. When we dumped all of that data from our email sending accounts into a Supabase database and pointed Claude Code at it, the patterns became obvious: the clients who ran more campaigns with us -- the ones who tested more segments, more messaging, more offers -- consistently outperformed the ones who went all-in on a single campaign.

The problem is that programmatic testing in cold email has always had bottlenecks that Facebook ads don't have.

Why Cold Email Testing Has Been Harder Than Paid Ads

With Facebook ads, the audience is largely selected for you by the algorithm. You set some filters, upload creative, and let the platform optimize. Cold email is different:

You need to build the audience yourself for every test
You need to route each contact to the right campaign based on their attributes (title, industry, headcount, etc.)
You need to create the campaigns in Smartlead or Instantly with the right copy and settings
You need to make sure nobody ends up in two different campaigns by accident

Running 5+ tests simultaneously used to mean hours of manual work in Clay tables: building all the formulas, uploading campaigns, being extremely militant about which formula maps to which campaign. It was a mess.

How Claude Code Changes the Math

Claude Code (along with Cursor and Codex) removes the two biggest bottlenecks:

Content generation and campaign ideation
Orchestration -- the if/then logic of routing contacts to campaigns

Here's how we're using it.

Three Claude Skills That Run Our Testing Framework

We've built three distinct skills in Claude that work together:

Skill 1: Campaign Strategy

This skill focuses on creating a balanced mix of campaign types:

Broad campaigns -- messaging that can go to every person in the ICP, no specific signal needed
Focused campaigns -- targeting a segment of the ICP based on employee headcount, job title, industry, or other differentiators
Niche campaigns -- built around a specific signal, like website visitor deanonymization or competitors' technology usage. Limited TAM, but high intent

The skill generates 10-20 campaign ideas per client, each with segmentation criteria, hooks, target audience, CTAs, and lead magnet suggestions.

Skill 2: Campaign Scoring

The second skill takes those ideas and scores them on likelihood of working.

We loaded all the campaigns we've ever run -- along with their analytics -- into Supabase and pointed about 30 Claude sub-agents at the data to extract rules about what makes a campaign excellent vs. average vs. below average. Those rules now power the scoring:

"Based on our past data and campaigns we've sent, we think this campaign can be improved with this idea, this idea, and this idea."

After scoring, the campaign ideas self-improve based on the feedback. The strategy skill adjusts its recommendations before anything gets written.

Skill 3: Copywriting

The third skill generates the actual email copy. It's trained on about 20 Google Docs of campaigns that performed well, plus content from cold email copywriters like Josh Braun whose thinking about messaging is consistently excellent.

The output matches the style and tone that has historically worked across our client base.

The Orchestration Problem (Solved)

The second half of the puzzle is getting the right contacts into the right campaigns based on their attributes. In plain terms: if this title AND this industry AND this employee headcount range, then go to this campaign.

This is where Claude Code shines. We give it:

A voice note describing how we want the campaign to go
The Google Doc with the approved copywriting and targeting
The client's ICP breakdown

Claude Code builds the routing plan, and we ask it to check for holes: "Are there any gaps in this plan we haven't thought about?" It surfaces edge cases we'd miss manually.

Hybrid Architecture: Clay + Trigger.dev/Railway

We're not replacing Clay for enrichment. Instead, we're building Trigger.dev or Railway worker endpoints that contain all the if/then routing logic. Those endpoints get called from within Clay tables, so we still get:

Clay's queuing and concurrency management
Claygent for advanced enrichment
Cloud-hosted reliability

We just dump the JSON into the Trigger.dev/Railway worker, and it handles the orchestration. Best of both worlds.

Why More Testing Is Better for Everything

Running more granular tests with smaller volumes per campaign is better across the board:

Deliverability -- different copy going out from each inbox means less repetitive content, which is better for inbox reputation
Faster learning -- cut losers early because the next test is already running
Lower risk -- if one campaign underperforms, it's a smaller volume hit than going all-in on a single campaign
Compounding insights -- after 5, 10, 15 campaigns, Claude can analyze the responses and surface which segments, messaging, and offers are working. Then it suggests the next test. The process becomes recursive at the client level

The Shift: Cold Email Becomes Programmatic

A year ago, the analogy was "cold email is like a private ads network" -- you're buying impressions in someone's inbox. Now the analogy goes further: the testing and iteration is becoming as programmatic as Meta and Google Ads.

The two bottlenecks -- content generation and orchestration -- are both solved by Claude Code. The speed of testing in 2026 is going to be radically different from what was possible before.

Key Takeaways

The highest-performing cold email clients are the ones who test the most -- same pattern Facebook found with advertisers
Cold email testing has historically been bottlenecked by audience building and campaign orchestration
Three Claude skills power the framework: campaign strategy (broad/focused/niche), campaign scoring (against historical data), and copywriting (trained on winning campaigns)
Claude Code handles the if/then routing logic that used to require hours of manual Clay table work
A hybrid architecture (Clay for enrichment + Trigger.dev/Railway for orchestration) gives the best of both worlds
More tests = better deliverability, faster learning, and recursive improvement at the client level