Next Sender, Please: Queue Architecture and Traffic Shaping

queue architecture

First things first, we want to set the scene: Today’s email environment is now shaped by the new sets of rules outlined and enforced by two major mailbox providers, Google and Yahoo. Here’s a quick run-down of those guidelines, which were implemented in February in large part to reduce the amount of spam sent to their users. This also indicates a shift to more granular reputation monitoring: moving further away from IP reputation, which can be impacted by a multitude of things including other senders, mailbox providers are now more interested in domain-based reputation signals.

To be compliant with their rules, senders were on the hook for making some big changes, like using authentication (particularly SPF, DKIM, and DMARC). So, what changes are we seeing now that we’re a couple months into enforcement? Have Google and Yahoo solved the decades-long spam scourge?

Not quite yet, but we are starting to see some interesting shifts in the world of deliverability operations.   

The Evolution of Deliverability Operations 

Deliverability operations, or DelOps (just like DevOps), is a fancier way to describe the more technical practices of email message processing and deliverability. While marketers focus on how list management, segmentation, and content impact deliverability, deliverability operations engineers are more concerned with setting PTR records, FBL registrations, and optimizations like ensuring their MTAs open only a maximum of 25 simultaneous connections per IP address to Comcast.

If I’ve lost you, that’s the point—DelOps folks are highly specialized email nerds.

Over time, the art of deliverability operations has become more obfuscated as cloud email infrastructure services (like us, SocketLabs) have taken on much of the ownership for these processes. Although Google and Yahoo’s new requirements seemed mostly to be the sender’s responsibility, they didn’t indicate major operations changes. However, we are seeing a groundswell implying otherwise.

What does that swell look like? Great question. We’ll start by laying out the important details to know in general, and then what we are doing at SocketLabs to turn that information into action and improvements for our customers.

Mailbox Providers Have Introduced New, Domain-Based SMTP Codes 

We’ve started to see new SMTP error codes from Google. Many of these errors use a 4xx response code and indicate a temporary error, suggesting that the message should be retried later. It might not seem groundbreaking, but historically, Google’s mail servers were never ones to defer or delay the delivery of campaigns; questionable mail was simply sent to the spam folder.

Prior to this year, it was extremely rare for Google to outright reject mail coming from legitimate senders. It seemed like Google had an unlimited amount of resources and storage space for undesirable mail. If you saw Google deferrals or blocks on a campaign, it meant something was really screwed up, and you needed to fix it.

Signs point to this changing.

Here are some examples of the new errors from Google:

421-4.7.28 Gmail has detected an unusual rate of unsolicited mail originating 421-4.7.28 from your DKIM domain [ 36]. To protect our users from spam, 421-4.7.28 mail sent from your domain has been temporarily rate limited. For 421-4.7.28 more information, go to 421-4.7.28 to 421 4.7.28 review our Bulk Email Senders Guidelines. o17-20020a05620a22d100b00787bed7f946si6356584qki.685 – gsmtp  

421-4.7.28 Gmail has detected an unusual rate of unsolicited mail originating 421-4.7.28 from your SPF domain [ 35]. To protect 421-4.7.28 our users from spam, mail sent from your domain has been temporarily 421-4.7.28 rate limited. For more information, go to 421-4.7.28 to 421 4.7.28 review our Bulk Email Senders Guidelines. a14-20020a05620a066e00b0078773da68dasi5722284qkh.654 – gsmtp 

There are a few critical things to note within these error codes.  

  1. They both indicate a reputation issue with the domain(s) authenticating a message.
  2. Based on our experience so far, Google is returning these errors much more frequently.

We’ve seen some campaigns encounter these errors and still perform quite well in terms of recipient engagement, as the mail is ultimately still being delivered, just with a delay. However, we do find that the length of the delay (aggressiveness of the throttle) strongly correlates with overall performance, and ignoring these early warning signals tends to lead to greater amounts of deferrals—and eventually, soft bounces—for the sender.

Google is not the only mailbox issuing new error codes when deferring the delivery of campaigns based on the reputation of the authenticating domains. Comcast has also implemented a new domain-based rate limitation SMTP response (RL000010) with a very detailed description on their postmaster site.

451 4.2.0 Throttled –

They note, “This rate limiting policy is based on historical volumes and quality of that domain’s volume. This should apply to both DKIM and SPF. Any systems affected by this rate limit will receive a 4xx message (temp-fail) during the SMTP transaction. This message is designed to instruct the sending server to try again at a later time to deliver its email.” 

Temporary Delivery Errors are Necessitating Lasting Changes 

Temporary errors, also known as deferrals, are not new. What makes these new errors somewhat different?   

Rather than indicating a problem with the sending IP address, these error messages indicate a problem with the domains authenticating the message.

Mailbox providers have traditionally found it easier to defer messages based on the sending IP address, a data point present in every SMTP connection and known before the transmission of the message body. However, with the requirement for authentication now in place, these providers can implement sophisticated, domain-based rate limiting that relies on authentication. This development impedes spammers, who can no longer bypass filters by failing to authenticate. Consequently, mailbox providers can now use domain-based reputation signals more reliably and securely.

This is starting to make more sense now, right? Since the Yahoogle policy changes are increasing the likelihood that a message will be authenticated, they are now better able to use sender reputation data during the message transfer process. This allows them to choose whether to accept a message upfront, rather than accepting it first and only considering the reputation when making placement decisions.

However, this creates a problem for ESPs sending mail through their mail transfer agent (MTA). Most MTAs build their message processing queues in groups based on the destination domain, not the From/Sender domain.

Think of this like a busy store with thousands of customers approaching the checkout lanes, but no one knows that the credit card processing system for Visa is down and all Visa cards are being rejected.

If the customer in front of you is paying with a Visa, you are going to be waiting while the cashier attempts to run their card a few times, eventually asking for a different form of payment. Had all the Visa card holders been redirected to their own checkout line, then all the other customers could have checked out without any delays caused by Visa card holders.

While it seems to make sense to segment all shoppers by their credit card brand, it might be wasteful for the store to have a checkout lane dedicated only to Diner’s Club card holders.

We wondered, what if you just did the segmentation dynamically when Visa or any other vendor was having an issue? What if we created queues based on where the email was headed, but also dynamically segmented traffic that was being throttled and causing deferrals?

We tried it out.

Introducing Dynamic Sub-Queues 

There are multiple ways to manage different message queues within a single MTA. For instance, our Hurricane MTA supports ‘accounts,’ or virtual MTAs (vMTAs).

In our MTA, each account has its own entirely unique queue, organized by destination domain. For senders managing only a handful of accounts, this can be an effective and relatively simple solution to maintain unique delivery pipes for each tenant.

Queue Details

But for senders with thousands of downstream customers, operating a unique account or virtualMTA for each becomes unwieldy and impractical. Accounts can eat up resources with each running their own queues and it creates complexity at scale for driving and managing configurations. This is why many ESPs find themselves using a resource designed for a single tenant in a multi-tenant fashion.

We solved a lot of these problems in our Complex sender email product, in which we broke an account into multiple components. We can now separately manage subaccounts (tenants) and IP pools (vMTA accounts). This arrangement enhances scalability by reducing the overhead associated with maintaining a single subaccount per customer. It also increases flexibility by enabling dynamic IP Pool mappings, allowing us to route messages for a single tenant through different vMTA accounts(IPs) based on criteria such as the subject line of the message.

Then we introduced a reporting and analysis toolset to work in tandem with this structure so it’s easy to get a birds-eye view of all activity while still being able to dive into the details of a single subaccount or IP pool.  

These changes also inherently incentivized pooling multiple subaccounts into shared IP pools, where the message queues are ultimately constructed. However, we still encountered issues in the ‘checkout lanes’ when one sending domain was generating errors at Gmail or Comcast, thus slowing things down for all the other reputable customers. We couldn’t let this problem remain unsolved.

Enter “Dynamic Sub-queues.” This feature of our Hurricane MTA allows for an MTA account to be configured with a delivery rule that creates a separate queue for emails sent from a specific From Address domain when a particular error message is encountered. Recall the new domain-based deferrals we’ve observed at Gmail and Comcast? When our MTA detects such deferrals, our delivery rule promptly establishes a separate queue for emails from domains like [email protected], while emails from other domains continue to be sent as usual.

Here is a chart explaining how we would build a queue for a specific sending domain based on encountering Comcast’s new RL000010 error message.

This new logic is fairly straightforward and works as follows:

  1. When a new message arrives at the MTA, we first check to see if a Dynamic Sub-queue already exists within the destination queue for that sending domain. If one exists, the message is placed into that separate queue.
  2. If a Dynamic Sub-queue does not already exist, the message will be placed in the general queue for that destination.
  3. We will attempt delivery, and if we encounter an SMTP response that has a corresponding rule to create a Dynamic Sub-queue, we will move that message and all other messages in the queue with that From domain into a new Dynamic Sub-queue.
  4. We can then apply traditional backoff logic, where messages in the Dynamic Sub-queue can be paused in delivery attempts for a short period of time.

We’ve been rolling out this feature across our cloud product with amazing results. Here are the key metrics we’ve been monitoring that speak to the benefits:

Reduced Deferral Events 

When we detect a specific transient SMTP error and form a dynamic sub-queue for a given sending domain, we can then pause delivery temporarily for that queue giving a mailbox provider like Gmail time to assess the reputation of that campaign.

By pausing delivery for just five minutes for the two SMTP responses mentioned earlier in the article, we are able to reduce the raw count of deferrals by over 99%. This means we are not wasting Google’s or our own resources by attempting to deliver messages that will only be deferred. The results visible in Google Postmaster Tools speak for themselves:

Queue Architecture rollout

Improved Time to Delivery 

Beyond just reducing raw counts of deferrals, we also aimed to see performance improvements in the time to delivery for IP pools where many different sending domains shared the queue. We conducted an A/B test on two high-volume IP pools, applying the new Dynamic Sub-queue logic to half the IPs in the pool and leaving the other IPs without the new logic.

Over the course of our five-day test, we processed approximately 31 million messages to Gmail addresses from 3,400 unique sending domains. We achieved a 58% improvement in the 95th percentile for time to delivery for all senders in the pools.

GMail all senders 95th percentile

We also isolated the data for the higher reputation senders that did not cause any deferrals during the test period and found their 95th percentile delivery time saw even greater improvements and decreased by 70%, from 96 seconds without sub-queues to 28 seconds with sub-queues. 

High reputation senders

Finally, when analyzing only low-reputation senders causing deferrals, their time to delivery increased significantly. This makes sense because we were adding a minimum of 300 seconds to the delivery time of messages after a deferral event was encountered. This is actually an ideal scenario as you want to process lower-reputation traffic at a slower rate to reduce the likelihood of more aggressive blocks developing and increase the chance filters can adjust, assuming the traffic itself is actually desirable.

Low reputation senders

With such great results from our testing, we are continuing to deploy these rules for more of our cloud customers. Dynamic Sub-queues are also available to on-premise Hurricane MTA customers in the most recent beta build. Contact our support team for more details about using beta versions of Hurricane MTA.

DelOps is Here to Stay 

This is the nerdy stuff that we love here at SocketLabs. Deliverability operations is not dead, and as we progress further into this new email era, it’s clear there is still a strong need to adapt and innovate in the email infrastructure arena.  

This shift to a greater focus on authenticated sending domain reputation doesn’t alleviate the need for ESPs to be proactive with shared IP pool management. Instead, it means ESPs need to have an infrastructure or MTA partner that shows how they are proactively crafting solutions that elevate the quality of their service and, by extension, the success of clients’ email messages. 

We’re focused on innovation at SocketLabs, is your email infrastructure partner?  

Table of Contents