Munging Posterous with CouchDB
Friday, 16 Dec 2011
Previously I used the Posterous API to retrieve all my blogs posts.
In this post, I’m going to show how easy it is to use CouchDB’s
_bulk_docs
API to get lots of data in via JSON import. Later on,
I’ll transform it using CouchDB’s show, view and list functions.
CouchDB’s bulk loading API requires JSON documents to be
embedded in an array called docs
within a parent JSON object
that contains optional parameters to indicate to CouchDB how to
handle the upload. Note that the _id
value must be a string.
{
"all_or_nothing": true,
"docs": [
{"_id": "1", ... },
{"_id": "2", ... }
]
}
I’m going to use the all_or_nothing
model and keep munging my data
until it works in one go. You could just as easily keep removing documents
that were successfully uploaded from your parent JSON object, which might
be a better approach if I had a lot of data.
The Posterous data I retrieved last time delivers my posts as a single JSON array. Here’s a trimmed sample:
{
"is_owned_by_current_user": true,
"slug": "motorola-marketing-from-the-munchkin-wrangler",
"body_excerpt": "motorola's cellphone naming looks to me like ...",
"views_count": 77,
"skip some text" : "...",
"comments_count": 0,
"current_member": "admin",
"id": 21298071,
"replies_count": 0
}
That id
field looks like an ideal choice to map to CouchDB’s _id
field.
It’s not a string though, so we’ll need to quote the following value.
As an old-school kinda guy, I did a perl one-liner. I am sure you node.js
ninjas out there can do it in 20 lines of valid js with beautiful
nested callbacks though.
The final step is to wrap the documents in the _bulk_docs
format we saw
initially - { "options" :..., "docs": [array]}
, and then run it through
jsonlint
to confirm that Douglas Crockford is happy. No doubt it would
look prettier in ruby.
$ perl -pi.dist -e 's/^\s+id":\s+(\d+)/\s+"_id": "\1"/g' posts.json
$ perl -e \
'print qq({\n "all_or_nothing": true,\n "docs":\n); \
print <stdin>;
print qq(}\n);' < posts.json > bulk_docs.json
$ jsonlint bulk_docs.json
input text encoding is utf-8
So let’s push this into a new CouchDB and see what happens. In the worst case, our upload will be rejected in its entirety and we’ll simply need to re-try with improved data.
$ DB=http://admin:passwd@localhost:5984/posts
$ curly -X PUT $DB
{
"ok" : true
}
$ curly -vX POST $DB/_bulk_docs -d @bulk_docs.json
* About to connect() to localhost port 5984 (#0)
* Trying ::1... Connection refused
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 5984 (#0)
* Server auth using Basic with user 'admin'
> POST /posts/_bulk_docs HTTP/1.1
> Authorization: Basic d3ViOnd1Yg==
> User-Agent: curl/7.21.4 (universal-apple-darwin11.0) \
> libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5
> Host: localhost:5984
> Accept: */*
> Content-Type: application/json
> Content-Length: 151723
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
} [data not shown]
< HTTP/1.1 201 Created
< Server: CouchDB/1.1.1 (Erlang OTP/R14B04)
< Date: Sat, 17 Dec 2011 00:08:09 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 1222
< Cache-Control: must-revalidate
<
{ [data not shown]
* Connection #0 to host localhost left intact
* Closing connection #0
[
{
"rev" : "1-fe25f5fa414e78b90b08959ef6763972",
"id" : "66018817"
},
{ "trimmed": "..." },
{
"rev" : "1-f4ee768657f44f87b6774489ceca042f",
"id" : "21298071"
}
]
In the subsequent post, I’ll transform it using CouchDB’s show, view and list functions to load into octopress.