Eaton – A Bouncer for WikiSpam and Blog Spam

Eugene Kim and I got together yesterday after the WikiSpam Workshop here at Wiki Symposium 2005 and hacked together a proof-of-concept universal blacklist wrapper called “Eaton”, which can be used with almost any CGI-based blog or wiki engine.

Eaton is the first client of the community-based wikispam blacklist maintained by the SharedAntiSpam effort that got started at Wikimania in August, and moved ahead a little more here at WikiSym.

The basic idea of SharedAntiSpam:

  • create a shared, URL-only wikispam blacklist, similar in concept to the blacklist Jay Allen maintained for MT-Blacklist

The basic idea of Eaton:

  • a simple, standalone script that runs in front of pretty much any CGI-based blog or wiki engine
  • the script examines all the text that gets posted from the browser to the blog or wiki engine
  • the input text is compared to one or several blacklists (the SharedAntiSpam blacklist, by default, but it can be configured to use others)
  • if there are matches, the script exits with an error code, causing an HTTP error to be returned to the user or robot trying to post the spam — keeping the spam from hitting the blog or wiki
  • if there are no matches, the script runs the blog or wiki engine and passes it the input from the browser, and the blog or wiki engine runs normally.

We wanted to get this out in the world and get other people working on it quickly, so the Eaton code is the simplest thing that could work, not anything fancy. I’m going to start using it here on my personal blog and on KaminskiWiki, and I expect Eugene and I and others to continue working on it to make improvements.

Here are some of the improvements that are needed:

  • documentation, including installation docs (basic idea: rename wiki.pl to something like real-wiki.pl; edit eaton.pl to call real-wiki.pl; rename eaton.pl to wiki.pl)
  • support for comments (or other fields TBD) in the blacklist (follow the SharedAntiSpam group’s design)
  • multiple blacklist support, with easy configurability
  • support for other types of banning, such as IP address blocks
  • support for a local blacklist
  • caching of remote blacklists, and merging of blacklist change feeds
  • support for GET method (do we need this?)
  • support other spam defense methods (Bayesian filters, throttling multiple posts with links to the same domain, etc.)
  • fancier coding (but keep the script small and the install easy and generic)

Similarly, the SharedAntiSpam effort is just getting started; if you’ve got ideas or would like to help, send email to Eugene.

We’re going to move all this onto SourceForge, but in the meantime, you can download the latest version of Eaton from http://www.eekim.com/software/eaton/eaton.pl.

Here is the original version, Eaton-20051016:

#!/usr/bin/perl
#
# Conceived and implemented by Peter Kaminski
# and Eugene Eric Kim  on October 16, 2005 at
# WikiSym.  Thanks to the good folks who attended the WikiSpam
# workshop.
#
# This code is public domain.  Do your worst.

use strict;
use LWP::UserAgent;
use URI::Escape;

my $BLACKLIST_URL = 'http://purplewiki.blueoxen.net/spam-blacklist.txt';
my $REAL_SCRIPT = 'wiki.pl';

undef $/;
my $postContent = ;

my $ua = new LWP::UserAgent(agent => 'antispam wrapper v0.1');
my $request = new HTTP::Request('GET', $BLACKLIST_URL);
my $result = $ua->request($request);

my $blacklist;
if ($result->is_success) {
$blacklist = $result->content;
} else {
die "unable to retrieve content: " . $result->code();
}
my @regexps = &parseBlacklist($blacklist);
my $joinedRe = join('|', @regexps);

my $decodedPostContent = uri_unescape($postContent);
if ($decodedPostContent =~ /$joinedRe/) {
foreach my $re (@regexps) {
if ($decodedPostContent =~ /$re/) {
die "spammer posting $re!  dying intentionally...";
}
}
}
else {
open FH, "|$REAL_SCRIPT" or die "$!";
print FH $postContent;
close FH;
}

### fini

sub parseBlacklist {
my $rawBlacklist = shift;

return split(/\n/, $rawBlacklist);
}

Comments (3)

  1. Jay Allen wrote::

    Pete, first of all, bravo to you fo your idea of making the wrapper script. It’s a very good idea.

    However, you should think very carefully about your method of lookup. At this point in October, I’ve received over 11 MILLION hits from people/bots/spammers trying to download the master blacklist file (57KB). This sort of thing doesn’t scale well.

    I would highly suggest instead that you have the script extract all URLs and then test them via DNS lookups to either your own blacklist service or one of the others (bsb.spamlookup.net, opm.blitzed.org). DNS lookups are not only faster, but also incur very little bandwidth when compared with having a client pull down an ever growing blacklist.

    Trust me on this, it’s not a fun road to ho.

    Wednesday, October 19, 2005 at 15:27 #
  2. Thanks, Jay. In the WikiMania discussions, there was a suggestion that we could forestall at least some of the traffic problems by hosting the blacklist on SourceForge, at least to start.

    Fortuitously, before WikiSym I had read about your MT-Blacklist experiences, and so I talked about them there at the SharedAntiSpam there. The current lookup method works for the quick proof-of-concept, but yeah, won’t work at any scale.

    The DNS lookup idea is a good one.

    Thanks for your experienced advice. :-)

    Wednesday, October 19, 2005 at 16:05 #
  3. Jay Allen wrote::

    You bet. Whatever I can do to help the anti-spam movement without becoming a poor man or having someone else become one. :-)

    Wednesday, October 19, 2005 at 19:42 #