Random Musings

O for a muse of fire, that would ascend the brightest heaven of invention!


Munging Posterous with CouchDB

Friday, 16 Dec 2011 Tags: couchdbmigrationposterous

Previously I used the Posterous API to retrieve all my blogs posts. In this post, I’m going to show how easy it is to use CouchDB’s _bulk_docs API to get lots of data in via JSON import. Later on, I’ll transform it using CouchDB’s show, view and list functions.

CouchDB’s bulk loading API requires JSON documents to be embedded in an array called docs within a parent JSON object that contains optional parameters to indicate to CouchDB how to handle the upload. Note that the _id value must be a string.

{
  "all_or_nothing": true,
  "docs": [
    {"_id": "1", ... },
    {"_id": "2", ... }
  ]
}

I’m going to use the all_or_nothing model and keep munging my data until it works in one go. You could just as easily keep removing documents that were successfully uploaded from your parent JSON object, which might be a better approach if I had a lot of data.

The Posterous data I retrieved last time delivers my posts as a single JSON array. Here’s a trimmed sample:

{
    "is_owned_by_current_user": true,
    "slug": "motorola-marketing-from-the-munchkin-wrangler",
    "body_excerpt": "motorola's cellphone naming looks to me like ...",
    "views_count": 77,
    "skip some text" : "...",
    "comments_count": 0,
    "current_member": "admin",
    "id": 21298071,
    "replies_count": 0
}

That id field looks like an ideal choice to map to CouchDB’s _id field. It’s not a string though, so we’ll need to quote the following value. As an old-school kinda guy, I did a perl one-liner. I am sure you node.js ninjas out there can do it in 20 lines of valid js with beautiful nested callbacks though.

The final step is to wrap the documents in the _bulk_docs format we saw initially - { "options" :..., "docs": [array]}, and then run it through jsonlint to confirm that Douglas Crockford is happy. No doubt it would look prettier in ruby.

$ perl -pi.dist -e 's/^\s+id":\s+(\d+)/\s+"_id": "\1"/g' posts.json
$ perl -e \
  'print qq({\n    "all_or_nothing": true,\n    "docs":\n); \
  print <stdin>;
  print qq(}\n);' < posts.json > bulk_docs.json
$ jsonlint bulk_docs.json
input text encoding is utf-8

So let’s push this into a new CouchDB and see what happens. In the worst case, our upload will be rejected in its entirety and we’ll simply need to re-try with improved data.

$ DB=http://admin:passwd@localhost:5984/posts
$ curly -X PUT $DB
{
   "ok" : true
}
$ curly -vX POST $DB/_bulk_docs -d @bulk_docs.json
* About to connect() to localhost port 5984 (#0)
*   Trying ::1... Connection refused
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 5984 (#0)
* Server auth using Basic with user 'admin'
> POST /posts/_bulk_docs HTTP/1.1
> Authorization: Basic d3ViOnd1Yg==
> User-Agent: curl/7.21.4 (universal-apple-darwin11.0) \
>    libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5
> Host: localhost:5984
> Accept: */*
> Content-Type: application/json
> Content-Length: 151723
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
} [data not shown]
< HTTP/1.1 201 Created
< Server: CouchDB/1.1.1 (Erlang OTP/R14B04)
< Date: Sat, 17 Dec 2011 00:08:09 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 1222
< Cache-Control: must-revalidate
<
{ [data not shown]
* Connection #0 to host localhost left intact
* Closing connection #0
[
   {
      "rev" : "1-fe25f5fa414e78b90b08959ef6763972",
      "id" : "66018817"
   },
   { "trimmed": "..." },
   {
      "rev" : "1-f4ee768657f44f87b6774489ceca042f",
      "id" : "21298071"
   }
]

In the subsequent post, I’ll transform it using CouchDB’s show, view and list functions to load into octopress.