Saturday, May 16, 2009

Using wget To Back Up a Blogger Blog

Update: After more twiddling it seems like the best incantation is

wget -rk -w 1 --random-wait -p -R '*\?*' -l 1
Specifying a recursion depth of 1 is sufficient to grab everything if you've got the archive widget on your page. More important, however, is the omission of the "-nc" switch. I'd assumed that this switch would prevent wget from downloading the same file multiple times in the same session. It does, but it also prevent wget from overwriting the file across multiple sessions, so you don't pick up any changes to the file. That, obviously, is undesirable behavior.


Backing up a Blogger blog with wget would seem to be a fairly simple task, but when I tried the naive method:

wget -rk http://aleph-nought.blogspot.com

I ended up with a bloated directory structure containing large numbers of pages with minor variations. This seems to mostly happen on account of the archive widget; wget (understandably) doesn't grok that, regardless of where it's found, the archive widget points to the same content. After some mucking around I settled on the following which seems to represent a good compromise between backing up redundant content and missing things:

wget -rk -w 1 --random-wait -p -R '*\?*' -nc http://aleph-nought.blogspot.com

This works out as follows:

  • -r: Recursive fetch.
  • -k: Convert references for local viewing.
  • -w 1: For politeness, set wait time to 1 second between fetches.
  • --random-wait: Randomly vary the wait time; this helps keep wget from being blocked as a bot.
  • -p: Fetch all material needed to render a page.
  • -R '*\?*': This is the secret sauce; it prevents wget from saving any URL containing a '?', which seems to eliminate most of the redundancy caused by the archive widget without sacrificing important content. Note that wget will still crawl these links.
  • -nc: Don't overwrite items which have already been downloaded.

0 Comments:

Post a Comment

<< Home

Blog Information Profile for gg00