一些分布式key/value存储系统的比较

来源：百度文库编辑：神马文学网时间：2024/04/28 20:12:39

你可能有各种理由要使用一种分布式的key/value存储系统，比如你可能正在考虑云计算，考虑使用couchdb，或者你有大量的碎文件需要保存等等，这些都可以通过分布式的key/value存储系统来实现或者辅助实现。

本文将给各位介绍一些流行的分布式key/value存贮系统，并做一些简单的比较。

词汇与背景阅读：

Distributed Hash Table (DHT) and algorithms such as Chord or Kadmelia

Amazon’s Dynamo Paper, and this ReadWriteWeb article about Dynamo which explains why such a system is invaluable

Amazon’s SimpleDB Service, and some commentary

Google’s BigTable paper

The Paxos Algorithm - read this page in order to appreciate that knocking up a Paxos implementation isn’t something you’d want to do whilst hungover on a Saturday morning.

下表是一些可能替代关系型数据库的系统，其中一些不止是key/value的存储系统，其中一些不适用于low-latency data serving，但是这些仍然非常有趣。

Name Language Fault-tolerance Persistence Client Protocol Data model Docs Community Project Voldemort Java partitioned, replicated, read-repair Pluggable: BerkleyDB, Mysql Java API Structured / blob / text A Linkedin, no Ringo Erlang partitioned, replicated, immutable Custom on-disk (append only log) HTTP blob B Nokia, no Scalaris Erlang partitioned, replicated, paxos In-memory only Erlang, Java, HTTP blob B OnScale, no Kai Erlang partitioned, replicated? On-disk Dets file Memcached blob C no Dynomite Erlang partitioned, replicated Pluggable: couch, dets Custom ascii, Thrift blob D+ Powerset, no MemcacheDB C no BerkleyDB Memcached blob B some ThruDB C++ Replication Pluggable: BerkleyDB, Custom, Mysql, S3 Thrift Document oriented C+ Third rail, unsure CouchDB Erlang Replication, partitioning? Custom on-disk HTTP, json Document oriented (json) A Apache, yes Cassandra Java Replication, partitioning Custom on-disk Thrift Bigtable meets Dynamo F Facebook, no HBase Java Replication, partitioning Custom on-disk Custom API, Thrift, Rest Bigtable A Apache, yes Hypertable C++ Replication, partitioning Custom on-disk Thrift, other Bigtable A Zvents, Baidu, yes

我所寻找的系统是一个低延迟，自动复制，分布式的key/value存贮系统，扩展简单，维护方便，api也非常简单，只是简单hash维护，set,get,delete等等，因而，以上列表中有5个并不能达到要求，但是他们还是值得一提的：

1、Hbase 在hadoop中有重要的应用，然而因为延迟现象严重，所以并不适合我们的需要

2、Hypertable 受google的 bigtable项目而创建，最近百度成了他的赞助商，但是这个项目同样因为延迟而不符合要求

3、 Cassandra 好像Facebook从来没有真正把它当作开源项目运作，极度缺乏文档。

4、CouchDB CouchDB ，总体来说是个不错的项目，可以通过RESTful HTTP/JSON API 与数据交互，然而应该来说还不够成熟，要达到像数据库字段一样的存储，包含一个一个字段的内容，还有有一段路要走

5、ThruDB 不错的文档存储引擎，包含了四个部分，但是因为与Couchdb同样的原因，也被放弃了。

除去以上5个，剩下列表中的都可以作为分布式KV存储系统的备选，基本上都有很好的扩展性，以及low latency

1、MemcacheDB 采用在后端BDB存储数据，然而在数据库sharding以及复制方面还没有考虑（原文如此，似乎最新的版本已经做了部分的考虑），其他的memcached sevrer 比如 repcached 考虑了复制问题，但是在sharding上没有做好。

翻译累了，剩下的看原文：

Project Voldemort looks awesome. Go and read the rather splendid website, which explains how it works, and includes pretty diagrams and a good description of how consistent hashing is used in the Design section. (If consistent hashing butters your muffin, check out libketama - a consistent hashing library and the Erlang libketama driver). Project-Voldemort handles replication and partitioning of data, and appears to be well written and designed. It’s reassuring to read in the docs how easy it is to swap out and mock different components for testing. It’s non-trivial to add nodes to a running cluster, but according to the mailing-list this is being worked on. It sounds like this would fit the bill if we ran it with a Java load-balancer service (see their Physical Architecture Options diagram) that exposed a Thrift API so all our non-Java clients could use it.

Scalaris is probably the most face-meltingly awesome thing you could build in Erlang. CouchDB, Ejabberd and RabbitMQ are cool, but Scalaris packs by far the most impressive collection of sexy technologies. Scalaris is a key-value store - it uses a modified version of the Chord algorithm to form a DHT, and stores the keys in lexicographical order, so range queries are possible. Although I didn’t see this explicitly mentioned, this should open up all sorts of interesting options for batch processing - map-reduce for example. On top of the DHT they use an improved version of Paxos to guarantee ACID properties when dealing with multiple concurrent transactions. So it’s a key-value store, but it can guarantee the ACID properties and do proper distributed transactions over multiple keys.

Oh, and to demonstrate how you can scale a webservice based on such a system, the Scalaris folk implemented their own version of Wikipedia on Scalaris, loaded in the Wikipedia data, and benchmarked their setup to prove it can do more transactions/sec on equal hardware than the classic PHP/MySQL combo that Wikipedia use. Yikes.

From what I can tell, Scalaris is only memory-resident at the moment and doesn’t persist data to disk. This makes it entirely impractical to actually run a service like Wikipedia on Scalaris for real - but it sounds like they tackled the hard problems first, and persisting to disk should be a walk in the park after you rolled your own version of Chord and made Paxos your bitch. Take a look at this presentation about Scalaris from the Erlang Exchange conference: Scalaris presentation video.

The reminaing projects, Dynomite, Ringo and Kai are all, more or less, trying to be Dynamo. Of the three, Ringo looks to be the most specialist - it makes a distinction between small (less than 4KB) and medium-size data items (<100MB). Medium sized items are stored in individual files, whereas small items are all stored in an append-log, the index of which is read into memory at startup. From what I can tell, Ringo can be used in conjunction with the Erlang map-reduce framework Nokia are working on called Disco.

I didn’t find out much about Kai other than it’s rather new, and some mentions in Japanese. You can chose either Erlang ets or dets as the storage system (memory or disk, respectively), and it uses the memcached protocol, so it will already have client libraries in many languages.

Dynomite doesn’t have great documentation, but it seems to be more capable than Kai, and is under active development. It has pluggable backends including the storage mechanism from CouchDB, so the 2GB file limit in dets won’t be an issue. Also I heard that Powerset are using it, so that’s encouraging.

Summary

Scalaris is fascinating, and I hope I can find the time to experiment more with it, but it needs to save stuff to disk before it’d be useful for the kind of things we might use it for at Last.fm.

I’m keeping an eye on Dynomite - hopefully more information will surface about what Powerset are doing with it, and how it performs at a large scale.

Based on my research so far, Project-Voldemort looks like the most suitable for our needs. I’d love to hear more about how it’s used at LinkedIn, and how many nodes they are running it on.

What else is there?

Here are some other related projects:

Hazelcast - Java DHT/clustering library
nmdb - a network database (dbm-style)
Open Chord - Java DHT

If you know of anything I’ve missed off the list, or have any feedback/suggestions, please post a comment. I’m especially interested in hearing about people who’ve tested or are using KV-stores in lieu of relational databases.

一些分布式key/value存储系统的比较一些分布式key/value存储系统的比较百度、新浪、Mixi、Apache社区赞助的开源key-value分布式存储系统 Flare－兼容Memcached协议的分布式key/value存储系统 - 张沈鹏,在路上... - JavaEye技术网站利用Tokyo Tyrant构建兼容Memcached协议、支持故障转移、高并发的分布式key-value持久存储系统[原创] - 回忆未来[张宴] - 服务器系统架构与底层研发利用Tokyo Tyrant构建兼容Memcached协议、支持故障转移、高并发的分布式key-value持久存储系统[原创] - 回忆未来[张宴] - 服务器系统架构与底层研发分布式 Key-Value 存储系统：Cassandra 入门 -- Linux,C,C ,Java,Ajax,XML,perl,php,python,ruby,MySQL,Gnome,KDE,Qt,Gtk,bash,shell,嵌入式,网络,信息安全,操作系统,数据结构,编译原理 InfoQ: 淘宝开源Key/Value结构数据存储系统Tair技术剖析 Bigtable：一个分布式的结构化数据存储系统 Relational DB vs. Key-Value store - 智障大师的专栏 - CSDN博客 Bigtable探秘 Google分布式数据存储系统(1) - 51CTO.COM 一些相位的比较比较受用的一些习惯 VALUE系列-BLOG能带来的VALUE 分享一些比较经典的开机画面本人比较喜欢的一些语录本人比较喜欢的一些语录一些比较实用的装修图片！一些比较实用的装修图片！世界上比较有趣的一些事儿一些比较实用的装修图片个人比较受用的一些习惯一些常用比较难处理的会计科目一些比较实用的装修图片