8 May 2015

Finding blog posts with broken links

How frequently do you end up with erratic HTML in your blog posts?  Not very frequently.  But if you do find your blog in such a state, how do you fix it?  I was playing around with my Picasa settings and in the process I managed to break all my album URLs.  All my albums now returned a 404.

I wanted to find all my blog posts that link to one of my Picasa Web albums and fix the broken links.  Unfortunately Blogger's search will only search the text of the posts, not the HTML.  What this means is that a post containing HTML like "<a href="http://picasaweb.google.com/mankis.pics/Pilgrimage2010>photos</a>" can be found by searching for "photos", but not by searching for "picasaweb" or "mankis.pics".  Like any self-respecting programmer, I decided I'd whip up a script that would do the searching for me.

To access my posts from a script I have two choices: use Blogger's API or get a dump of my blog by exporting it to a file.  API would be a good choice if I have to edit a great number of posts, but I only have to edit a few posts so I went ahead with the latter option.

If I pass the export XML file and the search string, my script would print URLs to edit posts that contain the string.  This is how I found all posts with links to a Picasa album:
% ./find_blog_posts.py \
     -b /tmp/blog-04-30-2011.xml \
     -s 'mankis.pics'
The script is available online if you want to use.