Overview

Me

previously, HP EMEA Managed Storage Services Manager
- about 7 Petabytes of storage, 160 customers
- 20k+ servers backed up
- 250 staff in a virtual organisation
these days, mainly an Apache CouchDB committer/developer
- less stress
- more fun
- less money

Typical Day in the Office

how many of these logos can you name?
what languages are used?

Take-away

computer languages are a tool to get things done
the ability to get that job done is related to your skill, and libraries
most languages have libraries that are far better than your skillset

Quick overview of CouchDB + BigCouch

A distributable nosql database
written in Erlang/OTP and JavaScript with a bit of C
Document database management system JSON over HTTP
Append-only MVCC storage

Views

A custom, persistent representations of your data (pre-built indexes)

Incremental MapReduce with results persisted to disk
Fast querying by primary key (views stored in a B-tree)
Bi-Directional Replication
Master-slave and multi-master topologies supported
Optional ‘filters’ to replicate a subset of the data to Edge devices (mobile phones, sensors, etc.)

More detail later.

Why Erlang?

Functional, ~ 20 years old, Prolog-like syntax (like grappa)
Very Lightweight processes (300+ bytes / proc)
Message Passing only
No shared state, No mutable state usually
Built-in functionality for distributed systems
Self-organising clusters to 50-70 nodes, more with care
Build “Apps” from groups of independent processes
Hot code loading = zero downtime upgrades

What does that really mean in practice?

Real Life CouchDB & Erlang

# start an erlang remote shell for introspection
erl -sname observer -hidden -run observer
# start a local couchdb instance
icouch

# pie is shorthand for the python-based httpie.org tool
which pie
 pie: aliased to http --verify=no --json --pretty all --verbose --style fruity
echo $COUCH
 http://localhost:5984
# clean up from talk testing
pie delete $COUCH/caffe ; pie delete $COUCH/latte
# make a new database
pie put $COUCH/caffe
# stream out any changes to the database over HTTP
# friendly to proxies, caches, works everywhere
pie get $COUCH/caffe/_changes\?feed=continuous\&heartbeat=30000 --stream

# add some documents using a nice shorthand JSON format 
pie put $COUCH/caffe/latte milk=true
pie put $COUCH/caffe/doppio espresso:='"double"'
pie put $COUCH/caffe/con-grappa grappa=moscato
# mmm we need a better grappa, this caffe is terrible!
pie put $COUCH/caffe/con-grappa grappa=primavera
# now with a correct `_rev` (based on md5 of document)
pie put $COUCH/caffe/con-grappa\?rev=1-ea392027022c2e554d3d02e90d99aba5 \
    grappa=primavera

# make a new database
pie put $COUCH/latte
# start a replication
# look in observer
pie post $COUCH/_replicator \_id=r2d2 source=caffe \
    target=latte continuous:=true
# look again & kill the process

This is quite unusual.

We can kill off arbitrary processes, and things just keep going. Minimal state is kept on disk, checkpointed periodically. Recovery is simple and fast.

I suggest this is a fundamental property of Erlang systems, and that the underlying features of the language make this extremely hard, arguably impossible, in any other language. Let’s take a look.

What happens if there’s …

mutable state within a process?

- code / logic can be harder to follow
- errors less evident
- single binding has its own traps wrt pattern matching

shared state between processes?

- no longer able to debug a single function as it may be modified elsewhere
- require locks, semaphores, mutexes for everything
- concurrent programming just got very hard

Erlang Message Passing Semantics

Message Passing is the only way to communicate in distributed systems.

spawned processes always succeed
message sending always succeeds
receipt is not guaranteed
in-order receipt from process A -> B is guaranteed
only way to ensure delivery is to receive a reply
only way to ensure a reply was received is to … !@#!#
monitors, links, and named (local or global) processes help us out

Changes in local state need to be bound within a single process or server, mutated over time via messages that loop back through that same function. We are forced to write code that is both simpler to reason about, and can be distributed without major rewrites.

######### COUCHDB IN DEPTH ############

Storage Layer

Durable append-only storage engine using B~trees and compressed entries (snappy)
Sequence tree enabling incremental processing of updates (time/history index)
Data structures supporting eventual consistency (each doc stored as a disjoint tree)
These primitives are exposed to build sophisticated replication & synchronization

B-Tree Structures on Disk

by_id: doc_id as key, including revision history for that doc
by_seqnum: history index, monotonically increasing. Provides _changes feed for replication, and view/index updates
root node always written last
header written twice for safety

Append Only

Rewrite path to root in each index on document update
Large sequential writes, smaller random reads
Wasted space must be periodically vacuumed
Disk is cheap (vs data loss)
SSD-friendly access pattern
This used to be controversial, now everyone does it (Postgres, SQLite)
Using Google Snappy (low-overhead in-memory compression, quite fast) per doc body

Sequence Index

Core of CouchDB’s replication super-powers
1 opaque token per DB (and view)
a remote node can quickly catch up on docs updated ?since=....
not a time machine - updates are coalesced & revs may be pruned in compaction
allows incremental, on-demand processing in background
geospatial indexes, full-text search, sync to other systems

############ CLUSTERING #############

What we mean by Scaling

Horizontal scaling: more servers creates more capacity
Transparent to the application: adding more capacity should not affect the business logic of the application. No single point of failure.

Eventual Consistency

“A shared-data system can have at most two of the three following properties: Consistency, Availability, and tolerance to network Partitions”

CAP theorem is often over-simplifed
“You must choose P” – Adam Kocoloski, CTO & Founder, Cloudant
Partitions happen, esp in distributed systems
either pick Consistency, or Availability
distributed systems are generally more useful with Availability put first
worse with larger clusters

P(overall failure) = 1−P(reliability of single node) ^ total nodes

very likely. e.g. 1 full server recovery / week / 400 servers

Funny Stories:

Electrician who turned wrong circuit breaker off while working. Turned it on very quickly hoping nobody would notice. Resulting surge blew up 300 servers taking major banks offline. Severe penalties for customer. No injuries.

While building a new datacentre on the same site as an existing one, a digger driver severed both redundant network connections while digging cable trenches. Thousands of servers, storage, and networking isolated. A few months later the same issue happened again. And … again.

Servers in a primary data centre were shut down because the temperature inside was too high. Lack of capacity planning was the root cause, driven by zealous cost saving measures. The data centre was due to be closed in a few years but the replacement data centre was not built.

Reliable Storage with Unreliable Systems

Dynamo paper in a nutshell:

Use a ring topology to represent a cluster providing a simpl- e service
Over-shard the ring — each physical node covers many virtual nodes
Nodes gossip with random neighbours, sharing join/leave messages and fact-based availability (B doesn’t responds in time to A)
Provide a load-balancer service over top of the ring
Each object (document) is stored as a blob + vector clock
The vector clock is used to reconcile cluster-internal conflicts
A sloppy quorum is used to accept writes, and to validate outgoing reads. These choices are exposed to the application, per object
Merkle trees are used to ensure internal replica consistency

########## REPLICATION ############

CouchDB’s super power

Combining revision trees with append-only storage means no data is ever lost, even during network partitions
But the application must be able to reconcile these eventual conflicts
Exposes the revision tree as a nested JSON structure per doc
A simple loop over the changes feed to reach all documents
A periodic checkpoint to make restarting less painful
A lot of this was developed without suitable academic papers backing it

Today we’d likely use:

Merkle hash trees (fast sync for previously unknown replicas)
Hash histories for conflict reconciliation
or possibly newer versions of CRDTs, e.g. ECDNs Conflict-free Replicated Data Type Eventually Consistent Distributed Counters

Conflict Reconciliation

A home-grown CRDT
Like git, using hashes of content
Includes a linear history for each branch
May end up with disjoint sets
On compaction, only leaf nodes remain
But internal history hashes are preserved
Bloats over time

Not Your Average Sync

Difficult to achieve linking a SQL DB (eg PostGres) with Couch, but possible — conceptually very different
Transfers updates from any source to any target
Builds on earlier primitives
sequence_id to determine what docs have changed
hash histories to find missing/changed revisions per doc
Critical anti-entropy element in clusters
BigCouch uses the same approach internally
DBs are split into partitions, sharded redundantly across servers
Partition copies replicate internally, ensuring durability and … eventually … consistency

################ VIEWS ######################

Map-Reduce (Google, 2004)

Input & Output: each a set of key/value pairs
Programmer specifies two functions:
map (in_key, in_value) -> list(out_key, intermediate_value)
Processes input key/value pair Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)
Inspired by similar primitives in LISP and other languages

Map-Reduce Views

CPU & Data are co-located
Independent pipelines
Suited for distributed computing over large sets
In (Big)Couch, this is effectively a mergesort
In CouchDB, maps are built serially with multiple map functions combined into a single pass
Written to disk serially because it is much simpler

########### ODDS AND ENDS ##############

Internal Data Structures

Use tuples {thing, other_thing, more_things} — fast
Lists used for B-tree processing
Sets & dictionaries used for merging views
Records (like C struct) used for key data structures like headers, view internals & b-tree nodes
Ongoing Introduction of “success typing” aka static types for dynamic languages

Evals and Pros of HTTP & JSON

HTTP is a horrible set of protocols to parse, with variable client support
JSON parsing as a practical problem - streaming vs fully validated data
Binary efficiency is often desirable (& why)
But HTTP compatibility brings fast scaling via HTTP-based etag caching & general web friendliness
Confusing raw throughput in tests vs real-world low-latency performance under load

Pragmatic Polyglot Programming

mixing erlang, C, javascript for fun & profit
let each do its best thing -
C for speed, & memory efficiency
erlang for distributed systems & low latency
javascript for web & JSON friendliness
in time, maintainability is more important than anything else
design for failure, & program the best case
make it work, make it pretty, make it fast

People & Diversity

what I used to do (operations, manager, outsourcing), & what I now do (programmer, startup)
more career options than you might think
always have a plan - 3 of them (Chairman Mao’s 5 year plans)
intro to Apache Software Foundation (I’m a member) & open source darwinism
you can’t fix everything, stay focused
community (& publicity) wins over code
why licences matter - freedom & pragmatism
just aim 1 cm higher every day
work & real life (balance, programming, people, health etc)
people skills most important in the long term
empathy, stress & burnout, privilege & gender awareness
be a specialist and a generalist
always water & grow your people networking & knowledge
iterate on habits that last a life time (TWBASI)
just aim 1 cm higher every day

Random Musings