Understanding HBase and BigTable | Javalobby

来源：百度文库编辑：神马文学网时间：2024/04/30 03:02:24

The heart of the Java developer community Home

Understanding HBase and BigTable

Submitted by Jim Wilson on Thu, 2008/05/22 - 2:45am

Tags:

bigtable
hbase

SourceForge Too Slow? Disappointed with SourceForge?
Host your project at JavaForge! JavaForge.com Ads by DZone

The hardest part about learning Hbase (the open source implementation of Google's BigTable), is just wrapping your mind around the concept of what it actually is.

I find it rather unfortunate that these two great systems contain the words table and base in their names, which tend to cause confusion among RDBMS indoctrinated individuals (like myself).

This article aims to describe these distributed data storage systems from a conceptual standpoint. After reading it, you should be better able to make an educated decision regarding when you might want to use Hbase vs when you'd be better off with a "traditional" database.

It's all in the terminology

Fortunately, Google's BigTable Paper clearly explains what BigTable actually is. Here is the first sentence of the "Data Model" section:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.

Note: At this juncture I like to give readers the opportunity to collect any brain matter which may have left their skulls upon reading that last line.

The BigTable paper continues, explaining that:

The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

Along those lines, the HbaseArchitecture page of the Hadoop wiki posits that:

HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.

Although all of that may seem rather cryptic, it makes sense once you break it down a word at a time. I like to discuss them in this sequence: map, persistent, distributed, sorted, multidimensional, and sparse.

Rather than trying to picture a complete system all at once, I find it easier to build up a mental framework piecemeal, to ease into it...

map

At its core, Hbase/BigTable is a map. Depending on your programming language background, you may be more familiar with the terms associative array (PHP), dictionary (Python), Hash (Ruby), or Object (JavaScript).

From the wikipedia article, a map is "an abstract data type composed of a collection of keys and a collection of values, where each key is associated with one value."

Using JavaScript Object Notation, here's an example of a simple map where all the values are just strings:

view source

print?1.{2. "zzzzz" : "woot",3. "xyz" : "hello",4. "aaaab" : "world",5. "1" : "x",6. "aaaaa" : "y"7.}

persistent

Persistence merely means that the data you put in this special map "persists" after the program that created or accessed it is finished. This is no different in concept than any other kind of persistent storage such as a file on a filesystem. Moving along...

distributed

Hbase and BigTable are built upon distributed filesystems so that the underlying file storage can be spread out among an array of independent machines.

Hbase sits atop either Hadoop's Distributed File System (HDFS) or Amazon's Simple Storage Service (S3), while a BigTable makes use of the Google File System (GFS).

Data is replicated across a number of participating nodes in an analogous manner to how data is striped across discs in a RAID system.

For the purpose of this article, we don't really care which distributed filesystem implementation is being used. The important thing to understand is that it is distributed, which provides a layer of protection against, say, a node within the cluster failing.

sorted

Unlike most map implementations, in Hbase/BigTable the key/value pairs are kept in strict alphabetical order. That is to say that the row for the key "aaaaa" should be right next to the row with key "aaaab" and very far from the row with key "zzzzz".

Continuing our JSON example, the sorted version looks like this:

view source

print?1.{2. "1" : "x",3. "aaaaa" : "y",4. "aaaab" : "world",5. "xyz" : "hello",6. "zzzzz" : "woot"7.}

Because these systems tend to be so huge and distributed, this sorting feature is actually very important. The spacial propinquity of rows with like keys ensures that when you must scan the table, the items of greatest interest to you are near each other.

This is important when choosing a row key convention. For example, consider a table whose keys are domain names. It makes the most sense to list them in reverse notation (so "com.jimbojw.www" rather than "www.jimbojw.com") so that rows about a subdomain will be near the parent domain row.

Continuing the domain example, the row for the domain "mail.jimbojw.com" would be right next to the row for "www.jimbojw.com" rather than say "mail.xyz.com" which would happen if the keys were regular domain notation.

It's important to note that the term "sorted" when applied to Hbase/BigTable does not mean that "values" are sorted. There is no automatic indexing of anything other than the keys, just as it would be in a plain-old map implementation.

Reference: Understanding HBase and BigTable Your rating: 0Average: 4.3 (3 votes)

Login or register to post comments
6394 reads
Email this Story
Printer-friendly version

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

About the author

First Name: Jim

Last Name: Wilson

Posts: 1

Joined: 10/13/2006

View full user profile

Scalability Cheat Sheet

Learn strategies for implementing scalable systems such as caching, clustering, redundancy, fault tolerance and more

A Look Inside EclipseLink

Peter Krogh provides an overview of the EclipseLink Persistence Platform and how it can be used in range of runtime environments.

Setting up your Flex Project

Sean Moore shows you how to use and apply a few best practices to your Flex development process.

Popular at DZone
JavaFX and Ajax
Dynamic Java Programming With Rule Engine
Test Driven Development
5 Days of Wicket - Putting it all together
Sun has posted the proposed final draft of JSR-318 Enterprise JavaBeans 3.1.
Using Multiple Swing Layouts to Create Dynamic Forms
Zero-install in-browser Groovy
An Introduction to the Java 2D API
A general-purpose utility to retrieve Java generic type values
Securing Web Services with HTTP Basic authentication method
See more popular at DZone
Subscribe to the RSS feed

Understanding HBase and BigTable | Javalobby 【译】Cassandra 和 HBase 中使用的 BigTable 模型 Understanding and Using SWOT Analysis Understanding and Using SWOT Analysis UNDERSTANDING AND TROUBLESHOOTING THE SPANNIN... Understanding and Solving Internet Explorer L... HBase vs Cassandra: why we moved ? Bits and B... Javalobby - Sun Java, JSP and J2EE technology programming forums, software downloads, jobs and tutorials BigTable原理 Understanding Histograms Understanding Polarizers Understanding Sharpness Don't Ignore serialVersionUID | Javalobby Google Research Publication: BigTable Understanding RSS Feeds JavaLobby's Top 10 Articles of 2008 Google's BigTable 原理详解 Understanding JAXB: Java Binding Customization TRIBALCOG | Understanding Exposure & the Zone... HBase的概念和性能选项 HBase vs. Cassandra: NoSQL Battle! | Road to ... YunTable-云时代的BigTable : 弯曲评论 Google's BigTable 原理（翻译） Understanding JavaServer Pages Model 2 architecture part1