Zed Lopez

How Many non-ASCII Characters in the Subject Line?

My email filter has four levels of spamminess: good, borderline, spam, and null (for the spammiest spam-spam-spam.) At least a couple of times a week, something I want ends up in spam. So I check it out every so often. This is a pain because there’s so much garbage. So I struck on a scheme to send more stuff that was ending up in spam straight to null: if more than half of the from or subject lines are non-ASCII, they go to null. My email filter: now 85% more provincial!

So I noodled around with Perl on the command line until I got this:

perl -MEmail::Folder -e 'use Encode qw(decode); $f=Email::Folder->new(shift @ARGV); for my $m ($f->messages) { $_=decode("Mime-header",$m->header("subject")); $a=()=/\p{ascii}/g; $l=length; $r = $l ? $a/$l : 0; print sprintf "% 3d % 3d %.2f %s\n", $a, $l, $r, $_ }' spam 

which became this in the filter script:

sub ascii_ratio {
  my $str = shift;
  return 0 unless length $str;
  my $num_ascii =()= $str =~ /\p{ASCII}/g;
  return $num_ascii / length $str;
}

my $from = decode("Mime-header", $email->header('from'));
my $subject = decode("Mime-header", $email->header('subject'));

if (ascii_ratio($from) < .5 or ascii_ratio($subject) < .5) {                                                        
$email->accept("/home/zed/Mail/IN.null");                                                                         
exit;                                                                                                             
} 

Notice that ‘ascii’ in the one-liner became ‘ASCII’ in the script. That’s because I developed the one-liner in perl 5.16 and am running the script on perl 5.10, and in 5.10, Unicode Character Property names are case-sensitive but in 5.12+, they’re not. Yay.