We read Dynamo today, and this is a summary of my own key takeaways. Amazon's Dynamo DB is interesting because it's a highly available and incrementally scaleable key-value store, optimized for Amazon's shopping cart, which requires a 99.9% write availability SLA. In particular, it allows service owners to customize their durability, consistency, and availabilities by configuring their read / write replication factors. Key ideas used and addressed in this paper include:
- Consistent hashing with virtual nodes
- Partition schemes for incremental scaleability
- Merkle Trees for conflict detection
- Sloppy quorum and hinted handoff for dealing with failures
- Gossip protocol for detecting membership changes
- Vector clocks for data versioning
Consistent Hashing with virtual nodes
- Consistent hashing
- Each node in the system is assigned a random value within this space which represents its “position” on the ring.
- Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position.
- Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring.
- The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.
- Problems
- Random position assignment of each node on the ring leads to non-uniform data and load distribution.
- Every time a new node comes online or offline, you have to rehash everything and recompute all Merkle Trees.
- Algorithm is oblivious to the heterogeneity in the performance of nodes.
- Virtual nodes as a solution
- Instead of mapping a node to a single point in the circle, each node gets assigned to multiple “virtual nodes” in the ring.
- The number of virtual nodes that a node is responsible can decided based on its capacity—allows exploitation of performance heterogeneity in nodes.
- Giving each physical node multiple virtual nodes also increases the number of partitions (i.e. key ranges), mitigating non-uniform data / load distribution.
Merkle trees for conflict detection
- Merkle trees are hash trees that allows conflict detection in logarithmic time. The idea is that all leaves are hashes of their data, and that all nodes are hashes of a concatenation of its direct children.
- Dynamo nodes keep a Merkle tree for each key range (set of keys owned by a virtual node).
Partition Schemes for Incremental Scaleability
- The default strategy was just to give out T random tokens per node and partition by token value.
- This had the problem of random key ranges, which changed as nodes joined and left the system.
- Key range changes resulted in scanning and rebuilding entire Merkle trees, which was expensive.
- Dynamo eventually migrated to a strategy with a fixed number of equally sized partitions. This saves recalculating any Merkel trees, since key ranges are fixed and leaves / joiners can just copy over a Merkle tree.
Sloppy Quorum
- We should preface this with how requests are fielded in Dynamo.
- Any client request (get or put) is eventually routed to a physical node that serves as a coordinator. The coordinator walks down a preference list of N nodes and requests a read or a write from each. We separately configure read replication factors (R) and write replication factors (W). Read requests requires a quorum of R consistent reads; writes requires a quorum of W consistent writes to succeed. If this is not achieved, the coordinator fails the request.
- Dynamo's quorum is called “sloppy” because instead of requiring the first N nodes, we accept that some nodes will be down and walk the first N healthy nodes.
- Hinted handoff is just the practice of storing writes temporarily on a backup node if the target write node is down. For example, a write to node A which is down, will be stored on node B until A comes back online again. When A comes back online, B transfers the written data to A and deletes it from itself.