Top 10 Reasons to Avoid the SimpleDB Hype — ryanpark.org

来源:百度文库 编辑:神马文学网 时间:2024/05/01 08:54:23

Top 10 Reasons to Avoid the SimpleDB Hype

April 21st, 2008 | Professional, Technology

There is a ton of chatter on the Internet about Amazon SimpleDB, Apache CouchDB, Google App Engine’s Datastore API,and other distributed key-value data stores. Their biggest perceivedadvantage is scalability: they can help eliminate the bottleneckimposed by single-server databases.

But the hype around these new databases is growing frantic. This morning I read an article by Todd Hoffwhich fawned over SimpleDB’s unconventional rules to such an extentthat I thought it might be satire. There are some significant drawbacksto developing in this new database paradigm. In fact, many of Mr.Hoff’s supposed advantages are actually serious disadvantages to theparadigm. Before designing your architecture around a database enginelike SimpleDB, it’s important to consider the reasons not to do so.

 

Most of my points are directed at the Amazon SimpleDB service, butmany also apply to other databases like CouchDB and the GoogleDatastore.

1. Data integrity is not guaranteed.

Data stores like SimpleDB don’t support the same rigorousconstraints that RDBMSes do. Some of these databases support single-rowconstraints, like requiring data in certain fields, but it is nearlyimpossible for these systems to enforceUNIQUE constraints and foreign keys.

Programmers can work around this by issuing extra queries to confirman update is valid, but this requires a lot of extra work. This willnever be perfectly accurate–it may be impossible to avoid raceconditions when two clients simultaneously attempt conflicting updates.And it’s especially difficult with SimpleDB because SimpleDB doesn’tguarantee that a client sees all the recent updates to the data.

2. Inconsistency will provide a terrible user experience.

Speaking of inconsistency, it’s critical to shield users from
this property of SimpleDB.

SimpleDB is optimized for fast writes. Your API calls return as soonas the data is written to the SimpleDB service, but before it’sreplicated across all of the SimpleDB servers. If you issue any queriesbefore the data is propagated, you won’t necessarily see your mostrecent change.

When I save my changes in your web application, I expect that yoursystem will show me a consistent view of those changes. If you show methe data that’s in SimpleDB, my changes might not appear, and I’llprobably get confused. In fact, I will probably freak out,thinking that you lost my data. You can try to inform me about how thisworks (”It will take a few minutes for your changes to be visible…”)but that’s not easy for users to grasp.

3. Aggregate operations will require more coding.

SimpleDB does not support aggregate operations like joins,GROUP BY,SUM/AVERAGE functions, and sorting. You will need to implement these yourself.

Todd Hoff argues that this “suckiness” is a fair tradeoff:

SimpleDB shifts work out of the database and ontoprogrammers which is why the SimpleDB programming model sucks: itrequires a lot more programming to do simple things. I’ll argue howeverthat this is the kind of suckiness programmers like. Programmers likeproblems they can solve with more programming. We don’t even care howtwisted and inelegant the code is because we can make it work. And aslong as we can make it work we are happy.

I disagree. More boilerplate code distracts you from actually solving real users’ needs. Why reinvent theGROUP BY wheel when MySQL, PostgreSQL and Oracle have already perfected it?

4. Complicated reports, and ad hoc queries, will require a lot more coding.

In my experience, database use falls into three broad patterns: (1)standard queries and updates performed by your application’s users; (2)more complicated reports for users and internal staff; and (3) ad hocqueries for troubleshooting and system monitoring. SimpleDB may beoptimized for category 1, but categories 2 and 3 will be much moredifficult without SQL.

Complicated reports are probably the best application of the SQLlanguage. Because SQL is a declarative language, it’s incredibly easyto generate aggregate information about your data. In my previous jobs,our reports often required hundreds of lines of SQL to get the rightinformation out of the database. This is a lot of code, but it wasrequired to generate the data for our customers. Without access to SQL,your programmers will need to implement reports through imperativestatements, which will exponentially increase the development time.

Ad hoc queries are even worse: they’re usually simpler, but they’realways changing. An RDBMS expert can often write an ad hoc SQL query asfast as the marketing department can explain what they need. Using animperative programming language to write these queries would destroyyour developers’ productivity.

5. Aggregate operations will be much slower if you don’t use an RDBMS.

RDBMSes are highly optimized for performing aggregate operations across huge volumes of data. Fast algorithms like the hash join, merge join, and indexed binary searchhave been around for 20 years or more. SimpleDB and the GoogleDatastore return datasets which are more like objects than traditionaldatabase rows. It’s unlikely that you’ll be able to process this datawith anything other than nested loops,especially if your programmers aren’t database algorithm experts.Nested loop algorithms are considerably slower than the others.

Even if you’re the 31337est database expert and enjoy writing theseoperations in your business objects, there’s another performance factorto consider. In order for your application server to handle aggregateoperations, you will need a copy of all the relevant data on theapplication server. Rather than downloading a singleSUMfunction result from the database, your application server will need todownload all the data required to calculate the sum. This extra datatransfer will add considerable latency when you’re dealing withthousands or millions of records.

6. Data import, export, and backup will be slow and difficult.

Oracle, MySQL and other RDBMSes include advanced tools to performlarge-scale data import and export operations. These tools have alsobeen refined for 20 years or so, and can process millions of rows perminute. There are no such tools for key-value data stores, becausethese products are so new.

When you’re processing millions of records, network latency makes abig impact. Most of these services perform a remote procedure call foreach record inserted; some even limit you to querying onerecord per remote call. On the Internet, round-trip latency is usually20-40ms, which may slow you down to fewer than 2,000 rows per minute.(You can process more quickly via multi-threading, but again, thatrequires you to write a lot more infrastructure code.)

7. SimpleDB isn’t that fast.

Todd Hoff’s article referenced a SimpleDB performance test whichfound that 10 record IDs could be retrieved in 141ms from a1,000-record table; in 266ms from a 100,000-record table; and in 433msfrom a 1,000,000-record table.

Compared to relational databases, this is pretty slow.

If you want your web application to be responsive, you need yourdatabase queries to operate much faster than this. 20ms responses wouldbe more in line with conventional databases. If you perform 3 SimpleDBqueries in series, your web app will take about 1.5 seconds for thatoperation, and users will notice when the app is that slow. Many webapplications actually make dozens of queries per request.

Further, tables with a million records aren’t large enough to needsignificant scalability. A million-record table is probably smallenough to fit entirely in RAM; surely its indexes could fit in RAM. Thereal test of SimpleDB scalability is its performance on a table with100 million or 1 billion records.

8. Relational databases are scalable, even with massive data sets.

The world’s largest companies all use giant relational databases,and they’ve been able to make this work. The world’s largest websitesuse relational databases, and they’ve also been able to scalesuccessfully. Facebook and LiveJournal use MySQL; MySpace uses Microsoft SQL Server; Salesforce.com uses Oracle. When websites like Friendster have scalability issues, it’s not usually because of the RDBMS.

We all expect Oracle to scale if we pay them enough money, but evenfree databases have made significant advances to prevent the databaseserver from becoming a bottleneck. The first line of defense is caching–eliminating repetitive queries can offload massive amount of processing. Beyond caching, there are free clustering engines which let you balance your database requests around a few servers in a cluster.

Without a complicated clustering setup, your data can usually bepartitioned across multiple servers to eliminate the single-serverbottleneck. Lest you think I’m ragging on Todd Hoff, he’s written a nice overview of sharding, one way of designing a federated database to get around the bottleneck.

9. Super-scalability is overrated. Slowing the pace of your product development is even worse.

Time-to-market is a critical factor for most software products. Ifyou’re writing internal software for a business, budgetary concerns areequally critical. You can workaround most of the drawbacks I’veidentified above, but it will cost you time and money.

More importantly, all these technical workarounds distract you fromaddressing the real needs of your customers. If you don’t focus on making something people want, it doesn’t matter how scalable your database is, because you won’t have any customers to fill up the database.

The hype around the new data stores seems to be a case of prematureoptimization, yet we all know Donald Knuth’s famous quote, “Prematureoptimization is the root of all evil.” Why not wait and addresssuper-scalability once you’ve created a super product and havegenerated super cash flow?

10. SimpleDB is useful, but only in certain contexts.

Everyone’s assuming that SimpleDB was designed to be ageneral-purpose replacement for OLTP database servers. I don’t think itwas ever intended for that purpose. SimpleDB’s architecture is similarto Dynamo,Amazon’s internal “highly-available key-value store.” One of its maindistinguishing features is the flexible schema: the ability to addcustom fields to individual records, and to store multiple values ineach field.

If you’re working with “semi-structured” data, then this is actuallyincredibly useful. For example, it’s an awesome way to persist webapplication sessions. You can avoid the overhead of marshaling theobject-oriented session data into columns and rows, and many of thedrawbacks above don’t apply because you don’t generally query sessionslike you query more typical relational data.

Amazon SimpleDB, Apache CouchDB, and the Google Datastore API aren’tbad products. But we do them a disservice when we construe them to bereplacements for general-purpose databases. Used carefully, they canhelp your organization. But used indiscriminately, you’ll create a lotmore work for your programmers and you’ll make your application performeven worse.