Unwanted-Email Filtering

This is the third of a series of webpages dealing with Emmesmail's development of email filters, all of which have been based on Paul Graham's 2002 seminal work on Bayesian email filtering. The first, anti-spam.html, explained how initially we adapted and implemented Paul Graham's proposals. The second, anti-spam.devel.html, documented our filter's behavior, performance and later development from 2015-2021.

This third webpage was started to acknowledge that the kind of emails we today are interested in diverting to a spam archive, are not the same kind we diverted 20 years ago. 20 years ago, those emails were truly spam, or "unsolicited bulk mail" as it alternatively was called. Today, the emails we are classifying as "spam" and diverting to the spam archive using Emmesmail, are not spam in its original formulation. They are not sent out in bulk from an unknown spammer. They often are directed specifically to us from institutions we have a relationship with; a bank, an organization or association we are part of. We divert these emails because we consider them unnecessary and unwanted and are unable either to sever our connection to the sender or persuade the sender not to send them. These are harder to filter because the emails often share much in common with required or wanted emails from the same sender. Nonetheless, we currently are attaining levels of filtering comparable to or better than previously, with results generally exceeding those reported elsewhere in the literature.

Period	Received	Director-diverted	Valid passed-all	Bayes-fp	Spam-Bayes	Spam-Directed	Spam-from-blacklist*	Spam-from-whitelist	Spam-missed	Bayes-Efficiency(%)	Overall-Efficiency(%)
2022	1243	7	853	56	314	0	0	7	24	93.0 ± 1.4
2023	1429	84	897	42	435	16	0	4	11	96.1 ± 0.9	99.2 ± 0.3
2024	847	11	396	10	405	3	12	2	7	96.2 ± 0.7	98.0 ± 0.5

^*filter active for the first half of 2024 only

As described elsewhere, our unwanted-email filter has more than a single filter. It currently has whitelist-filtering, classic Bayesian-filtering, directorfile-filtering, and blacklist-filtering.

As far as I know, directorfile filtering is unique to our system. The directorfile's primary purpose is to decide, based upon sender, apparent recipient, and subject, to which email account should delivery be attempted. During delivery, the email is subjected to the rest of our filters (using account-specific parameters) and may in fact end up being delivered to an account's spam mailbox. Another important aspect of directorfile filtering is that the directorfile can decide not to subject the email to subsequent filtering, but, again based upon sender, apparent recipient, and subject, to divert the email directly to an account mailbox. This is particularly useful in handling vitally important 2-factor authentication emails, which except for the subject, often have similar format to unwanted emails sent by the same sender.

Our whitelist-filter is similar to whitelist-filters used by other email filtering programs. Mail from senders on the whitelist are sent to the designated recipient regardless of the assessment of the Bayesian filter.

The blacklist filter is very different from that used by us a decade ago. It is much more modest. It is not a list of every spammer known to us. It is short list of unwanted email senders (currently only six addresses) who have succeeded in getting an unwanted email past our defenses within the past year. It's primary purpose is to send emails to the spam folder while the Bayesian filter is in its learning phase. The blacklist filter is applied after and is a check on our Bayse filter. Thus, any email designated as spam-blacklist is considered a success when calculating overall efficiency, but an error when calculating Bayes efficiency.

The spam-whitelist category, unwanted email from someone on our whitelist, is generally ignored, since it represents an error by the user putting that sender on the whitelist, rather than an error of any of the filters.

For historical reasons, the efficiency of Baysian filters in processing email, has differed from the calculation of efficiency in other fields in that it only considered false negatives (spam not caught by the filter) as errors. To be consistent with the more globally accepted definition, the efficiency calculation also should take into account false-positives (non-spam email designated by the spam filter as "spam"), and this is what we have done.

In our case tt, total emails tested = emails-received - emails-diverted-from-filtering-by-the-director-file and errs is the number of errors.

For the Bayes-efficiency calculation, errs equals the sum of all emails characterized as spam-missed, spam-directorfile, spam-blacklist, or ok-false-positive.

For the Overall-efficiency calculation, the only errors are those designated spam-missed or ok-false-positive.

Ignoring the tiny number of emails with classification spam-whitelist has no practical effect upon the calculations.

How Emmesmail Handles Unwanted Email

Results

2023-24: Filters used currently in our unwanted-email filter

Efficiency Calculation