Richard Jones | Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq.

来源:百度文库 编辑:神马文学网 时间:2024/05/05 01:49:29

Anti-RDBMS: A list of distributed key-value stores

Perhapsyou’re considering using a dedicated key-value or document storeinstead of a traditional relational database. Reasons for this mightinclude:

  1. You’re suffering from Cloud-computing Mania.
  2. You need an excuse to ‘get your Erlang on’
  3. You heard CouchDB was cool.
  4. You hate MySQL, and although PostgreSQL is much better, it still doesn’t have decent replication. There’s no chance you’re buying Oracle licenses.
  5. Your data is stored and retrieved mainly by primary key, without complex joins.
  6. You have a non-trivial amount of data, and the thought of managing lots of RDBMS shards and replication failure scenarios gives you the fear.

Whatever your reasons, there are a lot of options to chose from. AtLast.fm we do a lot of batch computation in Hadoop, then dump it out toother machines where it’s indexed and served up over HTTP and Thriftas an internal service (stuff like ‘most popular songs in London, UKthis week’ etc). Presently we’re using a home-grown index format whichpoints into large files containing lots of data spanning many keys,similar to the Haystack approach mentioned in this article about Facebook photo storage.It works, but rather than build our own replication and partitioningsystem on top of this, we are looking to potentially replace it with adistributed, resilient key-value store for reasons 4, 5 and 6 above.

This article represents my notes and research to date on distributedkey-value stores (and some other stuff) that might be suitable as RDBMSreplacements under the right conditions. I’m expecting to try some ofthese out and investigate further in the coming months.

Glossary and Background Reading

  • Distributed Hash Table (DHT) and algorithms such as Chord or Kadmelia
  • Amazon’s Dynamo Paper, and this ReadWriteWeb article about Dynamo which explains why such a system is invaluable
  • Amazon’s SimpleDB Service, and some commentary
  • Google’s BigTable paper
  • The Paxos Algorithm - read this page in order to appreciate that knocking up a Paxos implementation isn’t something you’d want to do whilst hungover on a Saturday morning.

The Shortlist

Here is a list of projects that could potentially replace a group ofrelational database shards. Some of these are much more than key-valuestores, and aren’t suitable for low-latency data serving, but areinteresting none-the-less.

#matrix td{ font-size:90%; vertical-align:top; padding: 3px; } #matrix tr { background: #f0f0f0; } #matrix tr.odd { background: #ddd; }#matrix td.bigger {font-size:100%;} Name Language Fault-tolerance Persistence Client Protocol Data model Docs Community Project Voldemort Java partitioned, replicated, read-repair Pluggable: BerkleyDB, Mysql Java API Structured / blob / text A Linkedin, no Ringo Erlang partitioned, replicated, immutable Custom on-disk (append only log) HTTP blob B Nokia, no Scalaris Erlang partitioned, replicated, paxos In-memory only Erlang, Java, HTTP blob B OnScale, no Kai Erlang partitioned, replicated? On-disk Dets file Memcached blob C no Dynomite Erlang partitioned, replicated Pluggable: couch, dets Custom ascii, Thrift blob D+ Powerset, no MemcacheDB C replication BerkleyDB Memcached blob B some ThruDB C++ Replication Pluggable: BerkleyDB, Custom, Mysql, S3 Thrift Document oriented C+ Third rail, unsure CouchDB Erlang Replication, partitioning? Custom on-disk HTTP, json Document oriented (json) A Apache, yes Cassandra Java Replication, partitioning Custom on-disk Thrift Bigtable meets Dynamo F Facebook, no HBase Java Replication, partitioning Custom on-disk Custom API, Thrift, Rest Bigtable A Apache, yes Hypertable C++ Replication, partitioning Custom on-disk Thrift, other Bigtable A Zvents, Baidu, yes


Why 5 of these aren’t suitable

What I’m really looking for is a low latency, replicated,distributed key-value store. Something that scales well as you feed itmore machines, and doesn’t require much setup or maintenance - itshould just work. The API should be that of a simple hashtable:set(key, val), get(key), delete(key). This would dispense with thehassle of managing a sharded / replicated database setup, and hopefullybe capable of serving up data by primary key efficiently.

Five of the projects on the list are far from being simple key-valuestores, and as such don’t meet the requirements - but they aredefinitely worth a mention.

1) We’re already heavy users of Hadoop, and have been experimenting with Hbasefor a while. It’s much more than a KV store, but latency is too greatto serve data to the website. We will probably use Hbase internally forother stuff though - we already have stacks of data in HDFS.

2) Hypertable provides a similar feature setto Hbase (both are inspired by Google’s Bigtable). They recentlyannounced a new sponsor, Baidu - the biggest Chinese search engine.Definitely one to watch, but doesn’t fit the low-latency KV store billeither.

3) Cassandra sounded very promising when thesource was released by Facebook last year. They use it for inboxsearch. It’s Bigtable-esque, but uses a DHT so doesn’t need a centralserver (one of the Cassandra developers previously worked at Amazon onDynamo). Unfortunately it’s languished in relative obscurity sincerelease, because Facebook never really seemed interested in it as anopen-source project. From what I can tell there isn’t much in the wayof documentation or a community around the project at present.

4) CouchDB is an interesting one - it’s a“distributed, fault-tolerant and schema-free document-oriented databaseaccessible via a RESTful HTTP/JSON API”. Data is stored in ‘documents’,which are essentially key-value maps themselves, using the data typesyou see in JSON. Read the CouchDB Technical Overview if you are curious how the web’s trendiest document database works under the hood. This article on the Rules of Database App Aginggoes some way to explaining why document-oriented databases make sense.CouchDB can do full text indexing of your documents, and lets youexpress views over your data in Javascript. I could imagine usingCouchDB to store lots of data on users: name, age, sex, address, IMname and lots of other fields, many of which could be null, and eachsite update adds or changes the available fields. In situations likethat it quickly gets unwieldly adding and changing columns in adatabase, and updating versions of your application code to match.Although many people are using CouchDB in production, their FAQ pointsout they may still make backwards-incompatible changes to the storageformat and API before version 1.0.

5) ThruDB is a document storage and indexingsystem made up for four components: a document storage service,indexing service, message queue and proxy. It uses Thrift forcommunication, and has a pluggable storage subsystem, including anAmazon S3 option. It’s designed to scale well horizontally, and mightbe a better option that CouchDB if you are running on EC2. I’ve heard alot more about CouchDB than Thrudb recently, but it’s definitely wortha look if you need a document database. It’s not suitable for our needsfor the same reasons as CouchDB.

Distributed key-value stores

The rest are much closer to being ’simple’ key-value stores with lowenough latency to be used for serving data used to build dynamic pages.Latency will be dependent on the environment, and whether or not thedataset fits in memory. If it does, I’d expect sub-10ms response time,and if not, it all depends on how much money you spent on spinning rust.

MemcacheDB is essentially just memcached that saves stuff todisk using a Berkeley database. As useful as this may be for somesituations, it doesn’t deal with replication and partitioning(sharding), so it would still require a lot of work to make it scalehorizontally and be tolerant of machine failure. Other memcachedderivatives such as repcachedgo some way to addressing this by giving you the ability to replicateentire memcache servers (async master-slave setup), but withoutpartitioning it’s still going to be a pain to manage.

Project Voldemort looks awesome. Go and read the rather splendid website,which explains how it works, and includes pretty diagrams and a gooddescription of how consistent hashing is used in the Design section.(If consistent hashing butters your muffin, check out libketama - a consistent hashing library and the Erlang libketama driver).Project-Voldemort handles replication and partitioning of data, andappears to be well written and designed. It’s reassuring to read in thedocs how easy it is to swap out and mock different components fortesting. It’s non-trivial to add nodes to a running cluster, butaccording to the mailing-list this is being worked on. It sounds likethis would fit the bill if we ran it with a Java load-balancer service(see their Physical Architecture Options diagram) that exposed a ThriftAPI so all our non-Java clients could use it.

Scalaris is probably the most face-meltingly awesome thingyou could build in Erlang. CouchDB, Ejabberd and RabbitMQ are cool, butScalaris packs by far the most impressive collection of sexytechnologies. Scalaris is a key-value store - it uses a modifiedversion of the Chord algorithm to form a DHT, and stores the keys inlexicographical order, so range queries are possible. Although I didn’tsee this explicitly mentioned, this should open up all sorts ofinteresting options for batch processing - map-reduce for example. Ontop of the DHT they use an improved version of Paxosto guarantee ACID properties when dealing with multiple concurrenttransactions. So it’s a key-value store, but it can guarantee the ACIDproperties and do proper distributed transactions over multiple keys.

Oh, and to demonstrate how you can scale a webservice based on sucha system, the Scalaris folk implemented their own version of Wikipediaon Scalaris, loaded in the Wikipedia data, and benchmarked their setupto prove it can do more transactions/sec on equal hardware than theclassic PHP/MySQL combo that Wikipedia use. Yikes.

From what I can tell, Scalaris is only memory-resident at the momentand doesn’t persist data to disk. This makes it entirely impractical toactually run a service like Wikipedia on Scalaris for real - but itsounds like they tackled the hard problems first, and persisting todisk should be a walk in the park after you rolled your own version ofChord and made Paxos your bitch. Take a look at this presentation aboutScalaris from the Erlang Exchange conference: Scalaris presentation video.

The reminaing projects, Dynomite, Ringo and Kai are all, more or less, trying to be Dynamo. Of the three, Ringolooks to be the most specialist - it makes a distinction between small(less than 4KB) and medium-size data items (<100MB). Medium sizeditems are stored in individual files, whereas small items are allstored in an append-log, the index of which is read into memory atstartup. From what I can tell, Ringo can be used in conjunction withthe Erlang map-reduce framework Nokia are working on called Disco.

I didn’t find out much about Kai other than it’s rather new,and some mentions in Japanese. You can chose either Erlang ets or detsas the storage system (memory or disk, respectively), and it uses thememcached protocol, so it will already have client libraries in manylanguages.

Dynomite doesn’t have great documentation, but it seems to bemore capable than Kai, and is under active development. It haspluggable backends including the storage mechanism from CouchDB, so the2GB file limit in dets won’t be an issue. Also I heard that Powersetare using it, so that’s encouraging.

Summary

Scalaris is fascinating, and I hope I can find the time toexperiment more with it, but it needs to save stuff to disk before it’dbe useful for the kind of things we might use it for at Last.fm.

I’m keeping an eye on Dynomite - hopefully more information willsurface about what Powerset are doing with it, and how it performs at alarge scale.

Based on my research so far, Project-Voldemort looks like the mostsuitable for our needs. I’d love to hear more about how it’s used atLinkedIn, and how many nodes they are running it on.

What else is there?

Here are some other related projects:

  • Hazelcast - Java DHT/clustering library
  • nmdb - a network database (dbm-style)
  • Open Chord - Java DHT

If you know of anything I’ve missed off the list, or have anyfeedback/suggestions, please post a comment. I’m especially interestedin hearing about people who’ve tested or are using KV-stores in lieu ofrelational databases.

UPDATE 1: Corrected table: memcachedb does replication, as per BerkeleyDB.

Tags: databases, dht, erlang, hashing, java