Random Musings

O for a muse of fire, that would ascend the brightest heaven of invention!


Overview

Me

  • previously, HP EMEA Managed Storage Services Manager
    • about 7 Petabytes of storage, 160 customers
    • 20k+ servers backed up
    • 250 staff in a virtual organisation
  • these days, mainly an Apache CouchDB committer/developer
    • less stress
    • more fun
    • less money

Typical Day in the Office

  • how many of these logos can you name?
  • what languages are used?

Take-away

  • computer languages are a tool to get things done
  • the ability to get that job done is related to your skill, and libraries
  • most languages have libraries that are far better than your skillset

Quick overview of CouchDB + BigCouch

  • A distributable nosql database
  • written in Erlang/OTP and JavaScript with a bit of C
  • Document database management system JSON over HTTP
  • Append-only MVCC storage

Views

A custom, persistent representations of your data (pre-built indexes)

  • Incremental MapReduce with results persisted to disk
  • Fast querying by primary key (views stored in a B-tree)
  • Bi-Directional Replication
  • Master-slave and multi-master topologies supported
  • Optional ‘filters’ to replicate a subset of the data to Edge devices (mobile phones, sensors, etc.)

More detail later.

Why Erlang?

  • Functional, ~ 20 years old, Prolog-like syntax (like grappa)
  • Very Lightweight processes (300+ bytes / proc)
  • Message Passing only
  • No shared state, No mutable state usually
  • Built-in functionality for distributed systems
  • Self-organising clusters to 50-70 nodes, more with care
  • Build “Apps” from groups of independent processes
  • Hot code loading = zero downtime upgrades

What does that really mean in practice?

Real Life CouchDB & Erlang

# start an erlang remote shell for introspection
erl -sname observer -hidden -run observer
# start a local couchdb instance
icouch

# pie is shorthand for the python-based httpie.org tool
which pie
 pie: aliased to http --verify=no --json --pretty all --verbose --style fruity
echo $COUCH
 http://localhost:5984
# clean up from talk testing
pie delete $COUCH/caffe ; pie delete $COUCH/latte
# make a new database
pie put $COUCH/caffe
# stream out any changes to the database over HTTP
# friendly to proxies, caches, works everywhere
pie get $COUCH/caffe/_changes\?feed=continuous\&heartbeat=30000 --stream

# add some documents using a nice shorthand JSON format 
pie put $COUCH/caffe/latte milk=true
pie put $COUCH/caffe/doppio espresso:='"double"'
pie put $COUCH/caffe/con-grappa grappa=moscato
# mmm we need a better grappa, this caffe is terrible!
pie put $COUCH/caffe/con-grappa grappa=primavera
# now with a correct `_rev` (based on md5 of document)
pie put $COUCH/caffe/con-grappa\?rev=1-ea392027022c2e554d3d02e90d99aba5 \
    grappa=primavera

# make a new database
pie put $COUCH/latte
# start a replication
# look in observer
pie post $COUCH/_replicator \_id=r2d2 source=caffe \
    target=latte continuous:=true
# look again & kill the process

This is quite unusual.

We can kill off arbitrary processes, and things just keep going. Minimal state is kept on disk, checkpointed periodically. Recovery is simple and fast.

I suggest this is a fundamental property of Erlang systems, and that the underlying features of the language make this extremely hard, arguably impossible, in any other language. Let’s take a look.

What happens if there’s …

mutable state within a process?

- code / logic can be harder to follow
- errors less evident
- single binding has its own traps wrt pattern matching

shared state between processes?

- no longer able to debug a single function as it may be modified elsewhere
- require locks, semaphores, mutexes for everything
- concurrent programming just got very hard

Erlang Message Passing Semantics

Message Passing is the only way to communicate in distributed systems.

  • spawned processes always succeed
  • message sending always succeeds
  • receipt is not guaranteed
  • in-order receipt from process A -> B is guaranteed
  • only way to ensure delivery is to receive a reply
  • only way to ensure a reply was received is to … !@#!#
  • monitors, links, and named (local or global) processes help us out

Changes in local state need to be bound within a single process or server, mutated over time via messages that loop back through that same function. We are forced to write code that is both simpler to reason about, and can be distributed without major rewrites.

######### COUCHDB IN DEPTH ############

Storage Layer

  • Durable append-only storage engine using B~trees and compressed entries (snappy)
  • Sequence tree enabling incremental processing of updates (time/history index)
  • Data structures supporting eventual consistency (each doc stored as a disjoint tree)
  • These primitives are exposed to build sophisticated replication & synchronization

B-Tree Structures on Disk

  • by_id: doc_id as key, including revision history for that doc
  • by_seqnum: history index, monotonically increasing. Provides _changes feed for replication, and view/index updates
  • root node always written last
  • header written twice for safety

Append Only

  • Rewrite path to root in each index on document update
  • Large sequential writes, smaller random reads
  • Wasted space must be periodically vacuumed
  • Disk is cheap (vs data loss)
  • SSD-friendly access pattern
  • This used to be controversial, now everyone does it (Postgres, SQLite)
  • Using Google Snappy (low-overhead in-memory compression, quite fast) per doc body

Sequence Index

  • Core of CouchDB’s replication super-powers
  • 1 opaque token per DB (and view)
  • a remote node can quickly catch up on docs updated ?since=....
  • not a time machine - updates are coalesced & revs may be pruned in compaction
  • allows incremental, on-demand processing in background
  • geospatial indexes, full-text search, sync to other systems

############ CLUSTERING #############

What we mean by Scaling

  • Horizontal scaling: more servers creates more capacity
  • Transparent to the application: adding more capacity should not affect the business logic of the application. No single point of failure.

Eventual Consistency

“A shared-data system can have at most two of the three following properties: Consistency, Availability, and tolerance to network Partitions”

  • CAP theorem is often over-simplifed

  • “You must choose P” – Adam Kocoloski, CTO & Founder, Cloudant

  • Partitions happen, esp in distributed systems

  • either pick Consistency, or Availability

  • distributed systems are generally more useful with Availability put first

  • worse with larger clusters

P(overall failure) = 1−P(reliability of single node) ^ total nodes

  • very likely. e.g. 1 full server recovery / week / 400 servers

Funny Stories:

Electrician who turned wrong circuit breaker off while working. Turned it on very quickly hoping nobody would notice. Resulting surge blew up 300 servers taking major banks offline. Severe penalties for customer. No injuries.

While building a new datacentre on the same site as an existing one, a digger driver severed both redundant network connections while digging cable trenches. Thousands of servers, storage, and networking isolated. A few months later the same issue happened again. And … again.

Servers in a primary data centre were shut down because the temperature inside was too high. Lack of capacity planning was the root cause, driven by zealous cost saving measures. The data centre was due to be closed in a few years but the replacement data centre was not built.

Further Reading

[dbmsmusings:] http://dbmsmusings.blogspot.co.uk/2010/04/problems-with-cap-and-yahoos-little.html

Reliable Storage with Unreliable Systems

Dynamo paper in a nutshell:

  • Use a ring topology to represent a cluster providing a simpl- e service
  • Over-shard the ring — each physical node covers many virtual nodes
  • Nodes gossip with random neighbours, sharing join/leave messages and fact-based availability (B doesn’t responds in time to A)
  • Provide a load-balancer service over top of the ring
  • Each object (document) is stored as a blob + vector clock
  • The vector clock is used to reconcile cluster-internal conflicts
  • A sloppy quorum is used to accept writes, and to validate outgoing reads. These choices are exposed to the application, per object
  • Merkle trees are used to ensure internal replica consistency

########## REPLICATION ############

CouchDB’s super power

  • Combining revision trees with append-only storage means no data is ever lost, even during network partitions

  • But the application must be able to reconcile these eventual conflicts

  • Exposes the revision tree as a nested JSON structure per doc

  • A simple loop over the changes feed to reach all documents

  • A periodic checkpoint to make restarting less painful

  • A lot of this was developed without suitable academic papers backing it

Today we’d likely use:

  • Merkle hash trees (fast sync for previously unknown replicas)
  • Hash histories for conflict reconciliation
  • or possibly newer versions of CRDTs, e.g. ECDNs Conflict-free Replicated Data Type Eventually Consistent Distributed Counters

Conflict Reconciliation

  • A home-grown CRDT
  • Like git, using hashes of content
  • Includes a linear history for each branch
  • May end up with disjoint sets
  • On compaction, only leaf nodes remain
  • But internal history hashes are preserved
  • Bloats over time

Not Your Average Sync

  • Difficult to achieve linking a SQL DB (eg PostGres) with Couch, but possible — conceptually very different
  • Transfers updates from any source to any target
  • Builds on earlier primitives
  • sequence_id to determine what docs have changed
  • hash histories to find missing/changed revisions per doc
  • Critical anti-entropy element in clusters
  • BigCouch uses the same approach internally
  • DBs are split into partitions, sharded redundantly across servers
  • Partition copies replicate internally, ensuring durability and … eventually … consistency

################ VIEWS ######################

Map-Reduce (Google, 2004)

  • Input & Output: each a set of key/value pairs

  • Programmer specifies two functions:

    map (in_key, in_value) -> list(out_key, intermediate_value)

    Processes input key/value pair Produces set of intermediate pairs

    reduce (out_key, list(intermediate_value)) -> list(out_value)

    Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)

  • Inspired by similar primitives in LISP and other languages

Map-Reduce Views

  • CPU & Data are co-located
  • Independent pipelines
  • Suited for distributed computing over large sets
  • In (Big)Couch, this is effectively a mergesort
  • In CouchDB, maps are built serially with multiple map functions combined into a single pass
  • Written to disk serially because it is much simpler

########### ODDS AND ENDS ##############

Internal Data Structures

  • Use tuples {thing, other_thing, more_things} — fast
  • Lists used for B-tree processing
  • Sets & dictionaries used for merging views
  • Records (like C struct) used for key data structures like headers, view internals & b-tree nodes
  • Ongoing Introduction of “success typing” aka static types for dynamic languages

Evals and Pros of HTTP & JSON

  • HTTP is a horrible set of protocols to parse, with variable client support
  • JSON parsing as a practical problem - streaming vs fully validated data
  • Binary efficiency is often desirable (& why)
  • But HTTP compatibility brings fast scaling via HTTP-based etag caching & general web friendliness
  • Confusing raw throughput in tests vs real-world low-latency performance under load

Pragmatic Polyglot Programming

  • mixing erlang, C, javascript for fun & profit
  • let each do its best thing -
  • C for speed, & memory efficiency
  • erlang for distributed systems & low latency
  • javascript for web & JSON friendliness
  • in time, maintainability is more important than anything else
  • design for failure, & program the best case
  • make it work, make it pretty, make it fast

People & Diversity

  • what I used to do (operations, manager, outsourcing), & what I now do (programmer, startup)

  • more career options than you might think

  • always have a plan - 3 of them (Chairman Mao’s 5 year plans)

  • intro to Apache Software Foundation (I’m a member) & open source darwinism

  • you can’t fix everything, stay focused

  • community (& publicity) wins over code

  • why licences matter - freedom & pragmatism

  • just aim 1 cm higher every day

  • work & real life (balance, programming, people, health etc)

  • people skills most important in the long term

  • empathy, stress & burnout, privilege & gender awareness

  • be a specialist and a generalist

  • always water & grow your people networking & knowledge

  • iterate on habits that last a life time (TWBASI)

  • just aim 1 cm higher every day