This is the third of a series of webpages dealing with Emmesmail's development of email filters, all of which have been based on Paul Graham's 2002 seminal work on Bayesian email filtering. The first, anti-spam.html, explained how initially we adapted and implemented Paul Graham's proposals. The second, anti-spam.devel.html, documented our filter's behavior, performance and later development from 2015-2021.
This third webpage was started to acknowledge that the kind of emails we today are interested in diverting to a spam archive, are not the same kind we diverted 20 years ago. 20 years ago, those emails were truly spam, or "unsolicited bulk mail" as it alternatively was called. Today, the emails we are classifying as "spam" and diverting to the spam archive using Emmesmail, are not spam in its original formulation. They are not sent out in bulk from an unknown spammer. They often are directed specifically to us from institutions we have a relationship with; a bank, an organization or association we are part of. We divert these emails because we consider them unnecessary and unwanted and are unable either to sever our connection to the sender or persuade the sender not to send them. These are harder to filter because the emails often share much in common with required or wanted emails from the same sender. Nonetheless, we currently are attaining levels of filtering comparable to or better than with our anti-spam filters.
|93.0 ± 1.4
|96.1 ± 0.9
|99.2 ± 0.3
|94.7 ± 1.4
|97.6 ± 1.0
As described elsewhere, our unwanted-email filter has more than a single filter. It currently has whitelist-filtering, classic Bayesian-filtering, directorfile-filtering, and blacklist-filtering.
As far as I know, directorfile filtering is unique to our system. The directorfile's primary purpose is to decide, based upon sender, apparent recipient, and subject, to which email account should delivery be attempted. During delivery, the email is subjected to the rest of our filters (using account-specific parameters) and may in fact end up being delivered to an account's spam mailbox. Another important aspect of directorfile filtering is that the directorfile can decide not to subject the email to subsequent filtering, but, again based upon sender, apparent recipient, and subject, to divert the email directly to an account mailbox. This is particularly useful in handling vitally important 2-factor authentication emails, which except for the subject, often have similar format to unwanted emails sent by the same sender.
Our whitelist-filter is similar to whitelist-filters used by other email filtering programs. Mail from senders on the whitelist are sent to the designated recipient regardless of the assessment of the Bayesian filter.
The blacklist filter is very different from that used by us a decade ago. It is much more modest. It is not a list of every spammer known to us. It is short list of unwanted email senders (currently only six addresses) who have succeeded in getting an unwanted email past our defenses within the past year. It's primary purpose is to send emails to the spam folder while the Bayesian filter is in its learning phase. The blacklist filter is applied after and is a check on our Bayse filter. Thus, any email designated as spam-blacklist is considered a success when calculating overall efficiency, but an error when calculating Bayes efficiency.
The spam-whitelist category, unwanted email from someone on our whitelist, is generally ignored, since it represents an error by the user putting that sender on the whitelist, rather than an error of any of the filters.
IMHO there are few, if any, email filters producing results as good as ours.
For historical reasons, the efficiency of Baysian filters in processing email, has differed from the calculation of efficiency in other fields in that it only considered false negatives (spam not caught by the filter) as errors. To be consistent with the more globally accepted definition, the efficiency calculation should also take into account false-positives (non-spam email designated by the spam filter as "spam"), and this is what we have done.
The eficiency of any filter is defined as
In our case tt, total emails tested = emails-received - emails-diverted-from-filtering-by-the-director-file and errs is the number of errors.
For the Bayes-efficiency calculation, errs equals the sum of all emails characterized as spam-missed, spam-directorfile, spam-blacklist, or ok-false-positive.
For the Overall-efficiency calculation, the only errors are those designated spam-missed or ok-false-positive.
Ignoring the tiny number of emails with classification spam-whitelist has no practical effect upon the calculations.