Migrating your blog posts to Markdown with Upmark and Nokogiri

February 3, 2012

This post was originally published in the Rambling Labs Blog on February 3, 2012.

As I said in my last post, for our new site, we changed our blog engine from WordPress to the Postmarkdown gem. At the end of that post, I mentioned that we had to migrate the old posts from WordPress to Markdown.

To do this, we built a ruby script using the Upmark gem and the Nokogiri gem. Nokogiri is used for HTML and XML parsing, among other things, while Upmark is used to generate Markdown from a given HTML.

First, we exported our old blog posts from WordPress to an XML file that looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<!-- This is a WordPress eXtended RSS file generated by WordPress as an export of your site. -->
<!-- ... -->
<rss version="2.0"
    <title>Rambling Labs</title>
    <pubDate>Fri, 23 Dec 2011 18:49:41 +0000</pubDate>
<!-- ... -->
<!-- Several items in the following format -->
      <title>The Name of the post</title>
      <pubDate>Mon, 05 Dec 2011 19:30:17 +0000</pubDate>
      <guid isPermaLink="false">http://www.ramblinglabs.com/?p=8</guid>
      <content:encoded><![CDATA[<!-- A lot of HTML -->]]></content:encoded>
      <!-- ... -->
<!-- ... -->

Then, on the script, we read the items with Nokogiri:

File.open("export.xml") do |file|
  items = Nokogiri::XML(file).xpath("//channel//item")

After that, we migrate the HTML to Markdown with Upmark:

  # ...
  items.each do |item|
    content = Upmark.convert(item.at_xpath("content:encoded").text)
  # ...

And finally, write the appropriate files (in app/posts for Postmarkdown) with these lines, inside the loop as well:

    date_str = item.at_xpath("wp:post_date_gmt").text + " +0000"
    name = item.at_xpath("wp:post_name").text.strip

    date = Time.parse(date_str).utc
    filename = date.strftime("%Y-%m-%d-%H%M%S-"+name+".markdown")
    path = 'app/posts/'+filename

    File.open(path, 'w') do |f|
      f.puts content

Pretty cool, huh?!