Resurrecting Dan Gillmor's Blog
Quick link to Dan's resurrected blog: http://bayosphere.com/
It started when I read this tweet from my friend Dan Gillmor:
dangillmor: o #hivemind: need help pulling my old blog postings (ideally w/comments) from archive.org; into in a Wordpress blog. can this be scripted?
You see, Dan Gillmor had started blogging back in 1999, when he worked for the San Jose Mercury News. The Wikipedia article on Dan says that his blog is believed to have been the first blog ever written by a journalist for a traditional media company -- a combination of Dan's visionary-ness and the fact that he was located in the Silicon Valley, where the web future was happening earlier than most other places.
Dan blogged a lot in the following years, but unfortunately, the blog itself was ultimately subject to webrot. As the original publisher's fortunes shifted with the changes in the traditional media market, the company changed hands a couple of times. Ultimately, the servers for the Dan's blog were shut down, and the data forgotten.
Except at the wonderful Internet Archive -- the project that Brewster Kahle started to save snapshots of the living web, to keep the posted information safe from webrot!
Digital archivist Rudolf Ammann had poked around, and found that the Archive had saved many of Dan's original posts.
The Internet Archive is a wonderful resource, but it's best suited for archival purposes. Hence Dan's request for help pulling the old posts out of Archive and back onto a living WordPress instance.
I jumped at the chance to help -- it's just the sort of thing I have fun doing, sifting through information and dusting it off to make it easy to use again. For this sort of thing, I use various web tools and a lot of bespoke Perl -- one-off scripts that can sift through the many archived web pages to pull out just the text of the post, then reformat it into something that can be imported into WordPress.
Here's Dan's 10th anniversary post, wherein he launches the resurrected blog at Bayosphere.com: Welcome to My Old Blog.
And here's Dan's first blog post (newly remastered and restored! :-) -- We Launch.
Some of the technical details might be interesting to cover. Dan started blogging on a beta version of Dave Winer's Manila platform. The Mercury tech team later switched to Moveable Type. I wrote custom parsing code for each platforms' posts; luckily the format was stable for each one, so the parsing wasn't too complicated.
The parsing included finding the content of each post and the date and time posted, and then copying that data over to a WordPress eXtended RSS (WXR)-formatted XML file. I also modified some of the image tags; the images hosted on the old Mercury servers weren't there any more, so I changed the existing "src" attribute's name to "orig-src", then added a new "src" attribute pointing at a placeholder image.
A nice outcome of the way the Internet Archive saved the posts was that comments were preserved inline with the post text, so we still have many of the original comments. There was a little comment spam, too; luckily not too much, which I deleted by hand. (Hopefully I caught all of it, but let me know if you notice any more.)
There were old email addresses in the posts and the comments; I removed the email addresses and turned the links to spans, with a popup that says "email address removed".
In all, we rescued about 1,250 posts. There are about 40 more that need some additional work (Archive was pulling too fast from the Manila servers, so it got error messages instead of actual data; I'm going to look through more Archive snapshots to see if it got the data at some point). I wrote a calendar script to enumerate all the days and check if there were posts during that day, and Dan and I found some fairly large chunks where we think there should be posts.
So the recovery is still a work in progress; perhaps we can recover more. But check it out -- it's great to see it back on the web!
-- Peter Kaminski 16:06, 26 October 2009 (PDT)