Understanding Modern Spam Filters

by Brian Godiksen - July 7, 2016

Sophisticated Spam Filtering Comic

Spam filtering used to be a cut and dry affair, very binary, black and white. Mail was either spam or it was not spam. Analyzing a message and deciding the bucket for each was also a simple process. For the most part filtering technology scaled well along with the rise of the internet. DNS blacklists, the most binary filtering concept, were efficient enough to remain effective. Over the years though the cost has dropped significantly for compute power and the perceived value of “the inbox” has steeply increased. This has led to a new age in spam filtering.

The goal of many filtering systems today is not just to sweep away malicious email messages but rather to keep the inbox clean of any email message that the recipient doesn’t want. What makes this even harder than it sounds is that the messages that are wanted can vary greatly from one recipient to the next. This is where the value of cheap compute power comes into play.

The Rise of Engagement Based Filtering

A few years ago Google started dabbling in machine learning. They realized that with 16,000 CPU cores chugging away at unlabeled YouTube videos, they could build a system that identified cats. Throwing thousands of CPU cores at finding cats is a pretty clear demonstration of the reduced cost barrier of compute power.

While finding cats in unlabeled videos is neat, there was value in the systems being built. The goal was to start using this technology in products to provide a better experience to end users. One of the first real world implementations was announced last year. Google disclosed it took this deep learning algorithm and began applying it to the spam problems faced by their one billion users. It makes an inbox composed of only wanted messages possible.

Google needed data about what is and isn’t wanted to start feeding the machine learning system. It immediately turned to the reaction of their users to each individual message as one of the primary data sources. This isn’t the first time that recipient engagement has played a role in message filtering, as spam complaints have long impacted deliverability. This is without question though the most robust implementation.

Using engagement data as a part of the spam filtering process is a natural fit for machine learning because it allows mailbox providers to give a more personalized experience to users and help fight off spammers, both positive outcomes. Machine learning was also very effective at making use of such a large data set. Making decisions based on hundreds of billions of data points isn’t easy. Recipient engagement is still only one of many factors that ultimately determine inbox placement, but it is quickly becoming more heavily relied upon, especially by mailbox systems like Gmail.

The Proof is in the Data

Google is now claiming that its spam filters accurately catch 99.9% of all spam, and only mislabel good messages 0.05% of the time. This level of accuracy is only obtainable by the inclusion of user reaction data in the filtering process. Here at SocketLabs we can see why engagement based spam filtering can be so accurate. Looking at data from the messages our customers send and breaking it out into categories show major distinctions in engagement. We are lucky to have insights into just about every type of email message that can be sent. We generally break down messages into one of three categories:

Person to Person
Transactional
Marketing

A general assumption which can be made is that person to person and transactional messages will be more “wanted”. The data reflects this very clearly. The most valuable data that any email sender can collect to determine engagement is the open rate and click rate. Here is a brief comparison of open and click rates between different types of email messages.

Traditional Marketing Campaign

A “batch and blast” style email campaign to about 150,000 recipients in the above example netted a roughly 20% open rate and 2% click-through rate. While slightly below average for its industry, this is the common reaction by recipients to this type of message. There is a strong likelihood that the reaction of recipients negatively impacted the inbox placement of messages for other recipients.

Triggered/Individual Marketing Message

Targeting messages and providing recipients with dynamic and relevant information significant increases engagement. With nearly three times as many messages being opened by recipients, this more highly targeted stream of marketing messages is much less likely to get filtered to spam. The more positive reaction to receiving this kind of message keeps them flowing to the inbox.

Transactional Messages

Since the SocketLabs platform tracks every open and click event, not just the first occurrence, highly desired retail/e-commerce receipts and shipping notices in the above mail stream generated 170%+ open rates and nearly 80% click through rates. Engagement rates at this level begin to over power other possibly negative sender reputation metrics like spam complaints, failures, and message content. While checking where your package is three times a day doesn’t get it to you any faster, it does help providers like Gmail know that you never want shipping confirmation messages put in your spam folder.

Person to Person and Inter-Office Mail

Person to person email messages are often the most critical message content transmitted. If the email from your boss requesting an update on this month’s sales numbers didn’t make it into your inbox, there would be many consequences. While messages from your boss may not be “wanted”, the fact that you opened and reread the message five times tells mailbox providers just how important that message really is to you. SocketLabs customers that have enabled open tracking on their personal messages have open rates between 100% and 500%. No other category of mail comes close to personal mail. While part of this engagement results from tracking pixels persisting in the message through replies and forwards, it is still impressive.

Engagement Based Filtering is Here to Stay

The long term direction for spam filtering on the internet is directly tied to the mailbox services space. This industry continues to condense as providers like Google and Microsoft convince businesses to ditch private on-premise mail servers and move to the cloud and consumers are lured to their wide array of free services. This equates to more data for the machine learning spam filtering systems to become more accurate. As a sender of email, it is a good thing, especially if you send highly wanted email messages. You will likely have fewer spam filters to battle.

Hopefully we made it clear with our own data about why engagement based filtering can be so accurate. This type of filtering is here to stay; it isn’t just a fad.

For SocketLabs customers not already tracking recipient engagement, contact our support team for help getting this feature set up. If you are not already a SocketLabs customer, then sign up now and start collecting data on the engagement of your messages today!

Understanding Modern Spam Filters

The Rise of Engagement Based Filtering