Building cold-email-1: What We Learned From Training a Small Model to Outperform the Big Ones

Building cold-email-1: What We Learned From Training a Small Model to Outperform the Big Ones

Robin Salimans

Robin Salimans

May 21, 2025

Today we release cold-email-1, a small language model fine-tuned specifically for writing outbound sales emails. It's part of a broader bet we're making at Utopian Labs: that highly focused, task-specific models can outperform general-purpose ones - not by being bigger, but by being better at a narrow job.

In this post, we’ll share how we trained the model, what went wrong along the way, how we measured success, and how we managed the trade-offs between competing goals like readability, relevance, and factual accuracy. We used a combination of synthetic data generation, filtering, and reinforcement learning with a few tweaks of our own.

If you’re into practical ML, RL, or just curious how you can get real performance out of smaller models, this might be useful.

Step 1: Data Generation and Filtering for Cold Start Fine-Tuning

Before we could do any reinforcement learning, we needed a solid starting point. For the initial cold start fine-tuning run, we generated a large amount of task-specific data, scored it across multiple dimensions, and only kept the best examples.

We selected a wide range of companies and people from public sources to simulate realistic outbound scenarios, then used our research + copywriting agent to generate over 300,000 cold emails, and then changed any real names (and other potentially personal identifiable information) to fake data for privacy reasons.

To filter this down, we applied a set of automated checks:

The most important filter was relevance. We used an LLM as a judge, scoring each email against a 15-point checklist that included things like:

We took the top ~3% of emails across all metrics—about 10,000 samples—and used them to fine-tune our base model.

Step 2: Fine-Tuning and Reinforcement Learning

We started with a Qwen-3 4B base model: large enough to be capable, but small enough to run efficiently and fine-tune quickly. After filtering the 10k high-quality emails, we fine-tuned the base model to give it a solid understanding of good cold outreach.

That already got us decent results. But the real performance gains came from reinforcement learning.

We used a method inspired by DeepSeek’s Group Relative Policy Optimization (GRPO), where the model learns not from a single absolute score, but from comparisons between its own generations. The idea is: given a set of outputs, learn which ones are better and use that to improve.

DeepSeek showed in their already famous R1 paper that GRPO works very well when training on tasks that have a single correct and verifiable answer, like math or physics problems. Since our model writes cold emails and doesn't solve math problems, we didn't have the luxury of being able to determine whether the model's answer was correct or incorrect. We therefore needed to come up with a new way of determining the quality of a model's output.

We ended up creating multiple reward functions; some of which scored deterministically, some judged by LLMs, and some a combination of both:

At first, we tried optimizing for everything at once. That gave us improvements in surface-level metrics: emails were shorter, easier to read, and used proper CTAs. But relevance plateaued.

So we changed tactics.

We ran a second reinforcement learning phase focused only on relevance. That worked: emails became significantly more targeted and thoughtful.

But we started seeing some hallucinations: the model would occasionally make things up to sound more relevant.

To fix that, we adjusted our training to always favor factual accuracy. This brought the relevance score down slightly (from ~0.75 to ~0.65), but made the outputs more trustworthy.

In the final (third) RL run, we reintroduced the other metrics and aimed for balance. We did this because increasing relevance actually decreased readability, instruction following, and increased word count.

To our surprise, when reintroducing the other rewards, all other rewards improved to previous levels (and further), whilst relevance stayed at the ending level of the run where we focused on relevance.

This gave us a model that was:

This multi-pass RL approach turned out to be the key to breaking out of local optima and actually improving the model across all dimensions.

Step 3: Fixing Weird RL Artifacts with One Last Fine-Tune

Reinforcement learning gave us strong improvements in relevance and overall quality, but it also introduced some quirks.

The model had learned to game certain reward metrics. For example, it figured out that phrases like "juggling X and Y" or "you must be eyeing X" helped boost readability and recipient focus scores. So it started using them everywhere. Every email was suddenly about juggling or eyeing something. It read well on paper, but didn’t sound natural at all.

To fix this, we generated a new dataset:

We then did a final supervised fine-tuning (SFT) run on this cleaned dataset. This last step helped bring the model back to a more natural style, while keeping all the gains from RL intact.

This was also a good reminder that reward functions, no matter how thoughtful, can still be gamed. A bit of cleanup at the end went a long way.

Step 4: Evaluation and Transparency

Once we had a model we were happy with, the next question was: how do we actually know it's good?

We didn’t want to rely on vague impressions or cherry-picked examples, so we evaluated cold-email-1 the same way we trained it: by running it through a structured checklist of real-world criteria. For relevance, we reused the same LLM-based judge with the 15-point framework. For judging the quality of CTAs, we used LLM-as-a-judge as well. For everything else (readability, word count, whitespace, etc.), we used deterministic scoring.

At Utopian Labs, one of our top priorities is reliability at scale. In the world of cold email, inconsistent output quality is a common problem; especially with general-purpose LLMs. Users often get burned after just a few tries when the model produces one good email and nine bad ones. We believe that consistently high-quality output, even across thousands of variations, is one of the biggest barriers to broader adoption. Solving that is core to why we’re building hyper-specialized models in the first place.

To make sure we weren’t overfitting to our own taste, we:

We also wanted others to be able to validate the model’s performance, not just take our word for it. So we:

The goal is to be transparent, not just about what the model can do, but also where it might fail.

What’s Next

cold-email-1 is available now via API (on request) or as part of our R1-Copywriting agent that plugs directly into CRMs and workflow builders. Whether you're integrating it into your own stack or using it through our tools, the idea is the same: better emails, fewer generic blasts, and full control over the message.

We’re also releasing a public scoring endpoint, so you can evaluate your own cold emails using the same framework we trained on.

This is just the first in a series. We’re working on more hyper-specialized models - each trained for a very specific task, each small enough to run efficiently, and each evaluated against real-world performance (not just benchmarks). We think there’s a lot of value in models that are narrow but deep.

If you’re curious about trying it out, experimenting, or building something on top, we’d love to hear from you.

© copyright Utopian, Inc. 2025. All rights reserved.