My email filter has four levels of spamminess: good, borderline, spam, and null (for the spammiest spam-spam-spam.) At least a couple of times a week, something I want ends up in spam. So I check it out every so often. This is a pain because there’s so much garbage. So I struck on a scheme to send more stuff that was ending up in spam straight to null: if more than half of the from or subject lines are non-ASCII, they go to null. My email filter: now 85% more provincial!
So I noodled around with Perl on the command line until I got this:
perl -MEmail::Folder -e 'use Encode qw(decode); $f=Email::Folder->new(shift @ARGV); for my $m ($f->messages) { $_=decode("Mime-header",$m->header("subject")); $a=()=/\p{ascii}/g; $l=length; $r = $l ? $a/$l : 0; print sprintf "% 3d % 3d %.2f %s\n", $a, $l, $r, $_ }' spam
which became this in the filter script:
sub ascii_ratio { my $str = shift; return 0 unless length $str; my $num_ascii =()= $str =~ /\p{ASCII}/g; return $num_ascii / length $str; } my $from = decode("Mime-header", $email->header('from')); my $subject = decode("Mime-header", $email->header('subject')); if (ascii_ratio($from) < .5 or ascii_ratio($subject) < .5) { $email->accept("/home/zed/Mail/IN.null"); exit; }
Notice that ‘ascii’ in the one-liner became ‘ASCII’ in the script. That’s because I developed the one-liner in perl 5.16 and am running the script on perl 5.10, and in 5.10, Unicode Character Property names are case-sensitive but in 5.12+, they’re not. Yay.