Migrating your blog posts to Markdown with Upmark and Nokogiri

February 3, 2012


This post was originally published in the Rambling Labs Blog on February 3, 2012.


As I said in my last post, for our new site, we changed our blog engine from WordPress to the Postmarkdown gem. At the end of that post, I mentioned that we had to migrate the old posts from WordPress to Markdown.

To do this, we built a ruby script using the Upmark gem and the Nokogiri gem. Nokogiri is used for HTML and XML parsing, among other things, while Upmark is used to generate Markdown from a given HTML.

First, we exported our old blog posts from WordPress to an XML file that looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<!-- This is a WordPress eXtended RSS file generated by WordPress as an export of your site. -->
<!-- ... -->
<rss version="2.0"
     xmlns:excerpt="http://wordpress.org/export/1.1/excerpt/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:wfw="http://wellformedweb.org/CommentAPI/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:wp="http://wordpress.org/export/1.1/"
        >
  <channel>
    <title>Rambling Labs</title>
    <link>http://www.ramblinglabs.com</link>
    <description></description>
    <pubDate>Fri, 23 Dec 2011 18:49:41 +0000</pubDate>
    <language>en</language>
    <wp:wxr_version>1.1</wp:wxr_version>
    <wp:base_site_url>http://www.ramblinglabs.com</wp:base_site_url>
    <wp:base_blog_url>http://www.ramblinglabs.com</wp:base_blog_url>
<!-- ... -->
<!-- Several items in the following format -->
    <item>
      <title>The Name of the post</title>
      <link>http://www.ramblinglabs.com/2012/12/the-name-of-the-post/</link>
      <pubDate>Mon, 05 Dec 2011 19:30:17 +0000</pubDate>
      <dc:creator>the_creator</dc:creator>
      <guid isPermaLink="false">http://www.ramblinglabs.com/?p=8</guid>
      <description></description>
      <content:encoded><![CDATA[<!-- A lot of HTML -->]]></content:encoded>
      <!-- ... -->
    </item>
<!-- ... -->
    </channel>
</rss>

Then, on the script, we read the items with Nokogiri:

File.open("export.xml") do |file|
  items = Nokogiri::XML(file).xpath("//channel//item")
end

After that, we migrate the HTML to Markdown with Upmark:

  # ...
  items.each do |item|
    content = Upmark.convert(item.at_xpath("content:encoded").text)
  end
  # ...

And finally, write the appropriate files (in app/posts for Postmarkdown) with these lines, inside the loop as well:

    date_str = item.at_xpath("wp:post_date_gmt").text + " +0000"
    name = item.at_xpath("wp:post_name").text.strip

    date = Time.parse(date_str).utc
    filename = date.strftime("%Y-%m-%d-%H%M%S-"+name+".markdown")
    path = 'app/posts/'+filename

    File.open(path, 'w') do |f|
      f.puts content
    end

Pretty cool, huh?!