Cloudkick | blog: 4 Months with Cassandra, a love story

来源:百度文库 编辑:神马文学网 时间:2024/04/28 06:10:06

4 Months with Cassandra, a love story

Update: Cloudkick has released two Cassandra monitors for column family and thread pool stats. Read the Cassandra checks support wiki for full details.

At Cloudkick we track a ton of metrics about our customer's serversand it's quite a challenge to store such massive amounts of data. Earlyon, we made the decision to avoid using tools like RRDTool, so we couldprovide a more holistic look at infrastructure. We had two firmrequirements: we wanted to show trends on a macro-level and to have verylow latency so our users would never wait for graphs to build. Weinitially used PostgreSQL, but as we added billions of rows, theperformance quickly degraded. We used cron jobs that would roll up thedata on intervals to trade storage for throughput and lower latency.With clever partitioning, we were able to stretch the system to acertain point, but beyond that we faced issues of needing a much biggermachine and faster rotational disks to accommodate our businessrequirements; that's when started looking at other solutions.

We evaluated many of the NoSQL options but in the end, we chose Apache Cassandra because it provided many advantages for our use-cases.

Advantages of Cassandra

Linear scalability

To meet our requirements, we needed a fast solution that allowed usto easily throw cloud servers at the problem. Specifically, we needed toachieve thousands of writes per second. Cassandra's architecture, beingbased on Amazon's Dynamo, could cope with our write load and make it easy to add more nodes to an existing cluster.

Massive write performance

Write performance in Cassandra is excellent. The internals arespecifically geared towards a heavy-write system. It writes to a memorytable and a serial commit log, and every so often the memory table isflushed to disk in what the Big Table paperdescribes as a sorted strings table, often called an SSTable — animmutable data structure. There is a lot more happening behind thescenes, but the performance characteristics are clear: there is nothingslow in the write path. The Cassandra wiki page on Architecture Internals provides more details.

Low operational costs

Traditional sharding and replication with databases like MySQL andPostgreSQL have been shown to work even on the largest scale websites —but come at a large operational cost. Setting up replication for MySQLcan be done quickly, but there are many issues you need to be aware of,such as slave replication lag. Sharding can be done once you reach writethroughput limits, but you are almost always stuck writing your ownsharding layer to fit how your data is created and operationally, ittakes a lot of time to set everything up correctly. We skipped that stepall together and added a couple hooks to make our data aggregationservice siphon to both PostgreSQL and Cassandra for the initialintegration.

Cassandra at Cloudkick

Cassandra has some unique terminology which may be unfamiliar. Please refer to the Data Model wiki page for an explanation of terms.

The data model is much less complex than a traditional SQL RDMS system.In the real world, this typically equates to denormalization of data. Agreat example of that is in Digg's blog poston their use of Cassandra. Our data model is a little bit different inthat it doesn't really need to be denormalized since it's inherentlysimple.

Cloudkick's primary use of Cassandra is for storing monitoring dataof different metrics, checked by our system. This equates down to a verysimple model for us. To store all the monitoring information thatCloudkick collects, it only requires a few Column Families. In oursystem, we have three big classes of data we store:

  • Metrics (numerics and text)
  • Rolled up data
  • Statuses at time slices

Of these different classes of data, all are "indexed" by timestamp.

Here's a Column Family declaration as it would appear in storage-conf.xml:

This particular Column Family stores every single original point ofdata in the system. The LongType declaration specifies that the columnnames are 8-byte values that determine the order of the columns. It'sanalogous to choosing a primary key in a database, except in Cassandrayou only get a couple "indexes" for free. Some people say Cassandra is a4 or 5 level hash, depending on how you configure the column family.The pseudo-hash is below:

[KeySpace][CF][Key][Column] or [KeySpace][CF][Key][SuperColumn][SubColumn]

You can also do range queries on one or two levels depending on thePartitioner you pick. With the Order Preserving Partitioner, you canperform a range_slice which can give you a slice of the keys and a sliceof the columns. This is pretty powerful since it gives you a fullcross-section of the data with a single call. However, we use the randompartitioner which only allows access to a range query from a columnlevel. In Python-land, we store each metric with the typical Thriftcall. Storing the column names is as simple as using the struct module (!Q is an 8-byte value in network order).

import structname = struct.pack('!Q', long_ts)

This snippet of code simply converts the time stamp to a value so it'sready to insert values into the system. It's important to keep the longcolumn names in network order as that's what the Cassandra Java processexpects. For the curious, Thrift takes care of byte order for valuesthat are explicitly typed as a value; however, since this is just a bytestream, it has no knowledge of the type of values and therefore doesn'treorder them.

Reading the values in Python is much of the same. If I wanted values fora time slice between '2009-09-01' and '2009-09-03', I could use thefollowing:

t1 = int(time.mktime(datetime(2009,9,1).timetuple()))t2 = int(time.mktime(datetime(2009,9,3).timetuple()))client.get_slice('MonitorApp', '', ColumnParent(column_family='NumericArchive'), SlicePredicate(slice_range=SliceRange(start=t1, end=t2)), ONE)

Another interesting bit of Cloudkick technology is how we roll values upinto different time slices. To keep querying fast, a mechanism likethis is necessary.

We keep track of the following roll-up intervals:

  • 5 m
  • 20 m
  • 30 m
  • 60 m
  • 4 hr
  • 12 hr
  • 1 day

For instance, if you want to look over the "month" time period, there's agood chance that you don't want to look at 5 minute intervals. If youlooked at the 5 minute intervals over a month, for one series you couldpotentially see 8640 points of data. (12 five minute intervals * 24 hours * 30 days)This is obviously too much data, so you would look at something moremanageable, like the one-hour or four-hour intervals, which would be 720or 180 points, respectively. In each point, we also store additionalinformation so the column definition looks like this:

This declaration is different than the NumericArchive declarationbecause it is a SuperColumn. If you remember the hash example above, wehave now added another layer which we can leverage. In theNumericArchive section we only stored the native point. Since we'recombining multiple points into one data point, there's more data for usto store. So for the sub-columns, we store some simple key-value pairs.Notice that it's sorted by BytesType which does a string comparison,this part is less interesting for us since we typically only retrieveeither all the sub-columns or none at all.

The different key-value pairs we store, with regard to the Rollup columns, are:

  • average - the average of all the points over the interval
  • count - the number of points accumulated in the interval
  • derivative - the change in value over the change in time (great for counters or gauges like bandwidth)

We still get the sorted values with another level of a hash, which isuseful as we can then retrieve the derivative and average all in onequery.

Hybrid NoSQL

Cloudkick still uses traditional SQL systems for much of our data — datalike our user accounts are stored in a standard master/slave MySQLsetup. We even keep data that references keys in Cassandra in MySQL sowe can quickly write a Django view that queries metadata about amonitoring check using the Django ORM, but still use Cassandra for thebulk of the data about that check. We'll likely keep moving more datainto Cassandra as we need to, but for some data the ability to writearbitrary SQL queries is still very useful.

Administration and operational issues

nodetool, previously known as nodeprobe

Cassandra includes a utility called nodetool, previously called nodeprobe.This utility lets you do common adminstrative tasks on your cluster,like checking if a node is up, decommissioning a node, or triggering acompaction. The Cassandra wiki page on nodetool provides more details.

Major compactions

Because SSTables are never modified on-disk, only replaced, you need todo a major compaction in order to reclaim all of your disk space if youdelete or modify a row.

You do need to keep a watch of your disk-space growth and makesure to trigger a major compaction periodically. Ideally, when yoursystem is under lower load.

Tombstones

Because Cassandra is a distributed database, deleting rows can be complicated. Jonathan Ellis' recent blog post does a fantastic job explaining how distributed deletes and tombstones can affect maintenance and administration.

Random / Order Preserving Partitioner

Choosing tokens for each node is a major part of a Cassandradeployment. The token selection decides which node stores the data inCassandra. If you're using the Order Preserving Paritioner, you must bevery careful about how you pick tokens, because with bad ones your datawill be lopsided, with signifigantly more being stored on a small numberof nodes. If you know how your keys are generated, you need topartition them as evenly as possible. If you don't need to doranged-based queries, we'd suggest using the Random Partitioner. TheCassandra Wiki contains more information on configuring the partitioner.

Client reconnection

Our original client would try a single Cassandra node and throw anexception if it was unable to connect. This might make sense if you'reused to a data storage system like MySQL where, if the master is down,you can't do much. In Cassandra's case, you want to make sure you try toquery multiple nodes before giving up — so make your clients try a listof servers before quitting.

Thrift issues

Since Cassandra uses Apache Thriftas the default RPC mechanism, exposing the Thrift layer to anynon-controlled data can be dangerous. We use firewalls on our nodes tomake sure our Thrift ports are only exposed to a very small set ofmachines, because even just telneting into the port and typing "hello"can cause the JVM to OOM. This is discussed in the THRIFT-601 issue report. In Cassandra trunk, Apache Avro is available as an alternative communication method, and shouldn't suffer from these types of issues.

What's missing?

There are many features we would like to see in Cassandra itself, but most of those (like compressed SSTables) are already being tackled by the developers.

We ran into many problems with the Ordered Partitioner, and it would benice for certain data models if this partitioner would work in a moreautomated fashion.

We also believe Cassandra would benefit from integration with existingopen source projects. For example, a Django ORM layer would instantlyexpose Cassandra to many more developers. Right now, everyone usingCassandra is building custom systems to communicate with it, but with alittle work, many more projects could be "Cassandra Enabled."

Cassandra community

The Cassandra community recently graduated to a top-level project at the Apache Software Foundation.Through the hard work of all its committers and contributors, theproject has come a long way in a very short amount of time. The opencommunity model is yielding an increasingly impressive data storagesystem, which is being used by many companies around the world.

Cassandra is quickly approaching its next iteration of version 0.6. Newfeatures include: row-level caching, support for Hadoop/MapReduce,authenticated connections, new statistics exposed over JMX, per-keyspacereplication factors, and a new batch_mutate method. The pace ofinnovation in the project is impressive and we're eager to upgrade.

We'd would like to sincerely thank all the users, contributors anddevelopers who have made Apache Cassandra such a successful project!