Google File Systems v2, part 1

来源:百度文库 编辑:神马文学网 时间:2024/04/18 20:35:26

Google File Systems v2, part 1

by Robin Harris on Monday, 17 August, 2009

A couple of years ago at the first Seattle Conference on Scalability, Google’s Jeffrey Dean remarkedthat the company wanted 100x more scalability. Unsurprising given therapid growth of the web. But there was more to it than that: GFS – the Google File System was running out of scalability.

The culprit: the single-master architecture of GFS. As described in the StorageMojo post Google File System Eval: Pt. 1

A GFS cluster consists of a single master and multiple chunkservers, and is accessed by multiple clients. . . .

If, like me, you thought bottleneck/SPOF when yousaw the single master, you would, like me, have been several stepsbehind the architects. The master only tells clients (in tiny multibytemessages) which chunkservers have needed chunks. Clients then interactdirectly with chunkservers for most subsequent operations. Now grok oneof the big advantages of a large chunk size: clients don’t need muchinteraction with masters to gather access to a lot of data.

The master stores — in memory for speed — three major types of metadata:

  • File and chunk names [or namespaces in geekspeak]
  • Mapping from files to chunks, i.e. the chunks that make up each file
  • Locations of each chunk’s replicas

Not as scalable as the Internet
As it turned out though, the single master did become a bottleneck. AsGoogle engineer Sean Quinlan explained in a recent ACMqueue interview:

The decision to go with a single master was actually one of the veryfirst decisions, mostly just to simplify the overall design problem.That is, building a distributed master right from the outset was deemedtoo difficult and would take too much time. . . .

. . . in sketching out the use cases they anticipated, it didn’t seemthe single-master design would cause much of a problem. The scale theywere thinking about back then was framed in terms of hundreds ofterabytes and a few million files. . . .

Problems started to occur once the size of the underlying storageincreased. Going from a few hundred terabytes up to petabytes, and thenup to tens of petabytes, that really required a proportionate increasein the amount of metadata the master had to maintain. Also, operationssuch as scanning the metadata to look for recoveries all scaled linearlywith the volume of data.

. . . this proved to be a bottleneck for the clients, even though theclients issue few metadata operations themselves—for example, a clienttalks to the master whenever it does an open. When you have thousands ofclients all talking to the master at the same time, given that themaster is capable of doing only a few thousand operations a second, theaverage client isn’t able to command all that many operations persecond.

Also bear in mind that there are applications such as MapReduce,where you might suddenly have a thousand tasks, each wanting to open anumber of files. Obviously, it would take a long time to handle allthose requests, and the master would be under a fair amount of duress. .. .

We ended up putting a fair amount of effort into tuning masterperformance, and it’s atypical of Google to put a lot of work intotuning any one particular binary. Generally, our approach is just to getthings working reasonably well and then turn our focus toscalability—which usually works well in that you can generally get yourperformance back by scaling things. . . .

It could be argued that managing to get GFS ready for production inrecord time constituted a victory in its own right and that, by speedingGoogle to market, this ultimately contributed mightily to the company’ssuccess. A team of three was responsible for all of that — for the coreof GFS — and for the system being readied for deployment in less than ayear.

But then came the price that so often befalls any successful system —that is, once the scale and use cases have had time to expand farbeyond what anyone could have possibly imagined. In Google’s case, thosepressures proved to be particularly intense.

Although organizations don’t make a habit of exchanging file-systemstatistics, it’s safe to assume that GFS is the largest file system inoperation (in fact, that was probably true even before Google’sacquisition of YouTube). Hence, even though the original architects ofGFS felt they had provided adequately for at least a couple of orders ofmagnitude of growth, Google quickly zoomed right past that.

Google has now developed a distributed master system that scales tohundreds of masters, each capable of handling about 100 million files.

Application pressures
Not only did the system need to scale far more than the designersanticipated, but the number of apps GFS supported also grew. And not allof those apps – think Gmail – could use a 64MB chunk size efficiently.

Thus the need to handle a 1MB chunk size and the number of filesassociated with smaller chunk and file sizes. That’s where BigTablecomes in as both a savior and a problem itself.

The StorageMojo take
I’ll continue this in a later post, but the moral of the story isobvious: Internet scale is unlike anything we’ve seen computing before.Even guys at the epicenter with immense resources and a clean sheetunderestimated the challenge.

Estimating exponential growth is a universal weakness for homo sapiens.All the cloud infrastructure vendors and providers need to think longand hard about how they will manage the growth that at least some ofthem will have.

But as Sean notes, if over-engineering gets you to the market late,you may not have the scale problem you were planning (hoping?) for.

Continued next post

Courteous comments welcome, of course.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

{ 1 trackback }

Google搜索新索引Caffeine分析 - 易IT博客
Tuesday, 13 July, 2010 at 10:35 am

{ 1 comment… read it below or add one }

Robert Pearson Thursday, 10 September, 2009 at 6:41 am

These are two of your finest posts.

“Google File System v2, part 1″
“Google File System v2, part 2″

To my regret these articulate, salient posts went unanswered.
This tells me that the level of “Storage Problem Solving” is still an
art and may never be a science.

The original problem of “Access Density” was tied specifically to disk drives.
Your two posts highlight where the coming “Access Density” problem
will be joined.
When I started looking at bandwidth needs for “End-to-End Information
on Demand (E2EIoD)” there were many other areas (bottlenecks, hot spots) for
restricting or impeding the flow of Information. I tried to identify themost important ones with the “Speed Limit of the Information Universe”concept.

To deal with the problem of the “fire fighting” IT mentality forbottlenecks and hotspots with $pounds and $pounds of $cure, an ounce ofprevention would be cheaper and work better. The ounce of preventionwould be to have Infrastructure Bandwidth costs added as a line item inthe IT budget.
Two options:
1) How much bandwidth do you need? How much can you afford?
2) How much can you afford? How much do you need?