readings in distributed system

来源:百度文库 编辑:神马文学网 时间:2024/04/27 22:34:20
BytepawnMarton Trencseni on Software, Systems and other Ideas.
  • About
  • Readings in Distributed Systems

Readings in Distributed Systems

An expanding list of papers on Distributed Systems. If you think something important is missing from this page, please email me.

 

I. The Google Papers

 

A complete list of papers written by Googlers is here. The 5 papers below describe the core of their MapReduce/GoogleFS/BigTable system.

1. The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
We have designed and implemented the Google File System, a scalabledistributed file system for large distributed data-intensiveapplications. It provides fault tolerance while running on inexpensivecommodity hardware, and it delivers high aggregate performance to alarge number of clients.

 

While sharing many of the same goals as previous distributed filesystems, our design has been driven by observations of our applicationworkloads and technological environment, both current and anticipated,that reflect a marked departure from some earlier file systemassumptions. This has led us to reexamine traditional choices andexplore radically different design points.

The file system has successfully met our storage needs. It is widelydeployed within Google as the storage platform for the generation andprocessing of data used by our service as well as research anddevelopment efforts that require large data sets. The largest clusterto date provides hundreds of terabytes of storage across thousands ofdisks on over a thousand machines, and it is concurrently accessed byhundreds of clients.

In this paper, we present file system interface extensions designed tosupport distributed applications, discuss many aspects of our design,and report measurements from both micro-benchmarks and real world use.

 

2. Bigtable: A Distributed Storage System for Structured Data

FayChang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E.Gruber
Bigtable is a distributed storage system for managingstructured data that is designed to scale to a very large size:petabytes of data across thousands of commodity servers. Many projectsat Google store data in Bigtable, including web indexing, Google Earth,and Google Finance. These applications place very different demands onBigtable, both in terms of data size (from URLs to web pages tosatellite imagery) and latency requirements (from backend bulkprocessing to real-time data serving). Despite these varied demands,Bigtable has successfully provided a flexible, high-performancesolution for all of these Google products. In this paper we describethe simple data model provided by Bigtable, which gives clients dynamiccontrol over data layout and format, and we describe the design andimplementation of Bigtable.

 

3. The Chubby Lock Service for Loosely-Coupled Distributed Systems

Mike Burrows
We describe our experiences with the Chubby lock service,which is intended to provide coarse-grained locking as well as reliable(though low-volume) storage for a loosely-coupled distributed system.Chubby provides an interface much like a distributed file system withadvisory locks, but the design emphasis is on availability andreliability, as opposed to high performance. Many instances of theservice have been used for over a year, with several of them eachhandling a few tens of thousands of clients concurrently. The paperdescribes the initial design and expected use, compares it with actualuse, and explains how the design had to be modified to accommodate thedifferences.

 

4. Paxos Made Live - An Engineering Perspective

Tushar Chandra, Robert Griesemer, and Joshua Redstone
Abstract We describe our experience building afault-tolerant data-base using the Paxos consensus algorithm. Despitethe existing literature in the field, building such a database provedto be non-trivial. We describe selected algorithmic and engineeringproblems encountered, and the solutions we found for them. Ourmeasurements indicate that we have built a competitive system.

 

5. MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat
MapReduce is a programming model and an associatedimplementation for processing and generating large data sets. Usersspecify a map function that processes a key/value pair to generate aset of intermediate key/value pairs, and a reduce function that mergesall intermediate values associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shown in the paper.

 

Programs written in this functional style are automaticallyparallelized and executed on a large cluster of commodity machines. Therun-time system takes care of the details of partitioning the inputdata, scheduling the program's execution across a set of machines,handling machine failures, and managing the required inter-machinecommunication. This allows programmers without any experience withparallel and distributed systems to easily utilize the resources of alarge distributed system.

Our implementation of MapReduce runs on a large cluster of commoditymachines and is highly scalable: a typical MapReduce computationprocesses many terabytes of data on thousands of machines. Programmersfind the system easy to use: hundreds of MapReduce programs have beenimplemented and upwards of one thousand MapReduce jobs are executed onGoogle's clusters every day.

 

II. Distributed Filesystems

 

Also see the Google Filesystem.

1. Petal: Distributed Virtual Disks

Edward K. Lee and Chandramohan A. Thekkath
The ideal storage system is globally accessible, alwaysavailable, provides unlimited performance and capacity for a largenumber of clients, and requires no management. This paper describes thedesign, implementation, and performance of Petal, a system thatattempts to approximate this ideal in practice through a novelcombination of features. Petal consists of a collection ofnetworkconnected servers that cooperatively manage a pool of physicaldisks. To a Petal client, this collection appears as a highly availableblock-level storage system that provides large abstract containerscalled virtual disks. A virtual disk is globally accessibleto all Petalclients on the network. A client can create a virtual disk on demand totap the entire capacity and performance of the underlying physicalresources. Furthermore, additional resources, such as servers anddisks, can be automatically incorporated into Petal. We have an initialPetal prototype consisting of four 225MHz DEC 3000/700 workstationsrunning Digital Unix and connected by a 155 Mbit/s ATM network. Theprototype provides clients with virtual disks that tolerate and recoverfrom disk, server, and network failures. Latency is comparable to alocally attached disk, and throughput scales with the number ofservers. The prototype can achieve I/O rates of up to 3150 requests/secand bandwidth up to 43.1 Mbytes/sec.

 

2. Frangipani: A Scalable Distributed File System

Chandramohan A. Thekkath, Timothy Mann, Edward K. Lee
The ideal distributed file system wouldprovide all itsusers with coherent, shared access to the same set of files,yet wouldbe arbitrarily scalable to provide more storage space and higherperformance to a growing user community. It would be highly availablein spite of component failures. It would require minimal humanadministration, and administration would not become more complex asmore components were added. Frangipani is a new file system thatapproximates this ideal, yet was relatively easy to build because ofits two-layer structure. The lower layer is Petal (described in anearlier paper), a distributed storage service that providesincrementally scalable, highly available, automatically managed virtualdisks. In the upper layer, multiple machines run the same Frangipanifile system code on top of a shared Petal virtual disk, using adistributed lock service to ensure coherence. Frangipani is meant torun in a cluster of machines that are under a common administration andcan communicate securely. Thus the machines trust one another and theshared virtual disk approach is practical. Of course, a Frangipani filesystem can be exported to untrusted machines using ordinary networkfile access protocols. We have implemented Frangipani on a collectionof Alphas running DIGITAL Unix 4.0. Initial measurements indicate thatFrangipani has excellent single-server performance and scales well asservers are added.

 

3. GPFS: A shared-disk file system for large computing clusters

Frank Schmuck, Roger Haskin
GPFS is IBM’s parallel, shared-disk file system for clustercomputers, available on the RS/6000 SP parallel supercomputer and onLinux clusters. GPFS is used on many of the largest supercomputers inthe world. GPFS was built on many of the ideas that were developed inthe academic community over the last several years, particularlydistributed locking and recovery technology. To date it has been amatter of conjecture how well these ideas scale. We have had theopportunity to test those limits in the context of a product that runson the largest systems in existence. While in many cases existing ideasscaled well, new approaches were necessary in many key areas. Thispaper describes GPFS, and discusses how distributed locking andrecovery techniques were extended to scale to large clusters.

 

4. Replication in the Harp File System

Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba Shrira, Michael Williams
Harp uses the primary copy replication technique [1, 26,27]. In this method, client calls are directed to a single primaryserver, which communicates This paper describes the design andimplementation of the with other backup servers and waits for them torespond Harp file system. Harp is a replicated Unix file system beforereplying to the client. The system masks failures by accessible via theVFS interface. It provides highly availperforming a failover algorithmin which an inaccessible able and reliable storage for files andguarantees that file server is removed from service. When a primaryperforms operations are executed atomically in spite of concurrency anoperation, it must inform enough backups to guarantee and failures. Ituses a novel variation of the primary copy that the effects of thatoperation will survive all subsequent replication technique thatprovides good performance befailovers. cause it allows us to trade diskaccesses for network comHarp is one of the first implementations of aprimary munication. Harp is intended to be used within a file sercopyscheme that runs on conventional hardware. It has vice in a distributednetwork; in our current implemensome novel features that allow it toperform well. The key tation, it is accessed via NFS. Preliminaryperformance performance issues are how to provide quick response forresults indicate that Harp provides equal or better response useroperations and how to provide good system capacity time and systemcapacity than an unreplicated implemen-(roughly, the number ofoperations the system can handle tation of NFS that uses Unix filesdirectly. in some time period while still providing good responsetime).

 

5. Swift: Using distributed disk striping to provide high I/O data rates

Luis-felipe Cabrera, Darrell D. E. Long
We present an I/O architecture, called Swift, thataddresses the problem of data rate mismatches between the requirementsof an application, storage devices, and the interconnection medium. Thegoal of Swift is to support high data rates in general purposedistributed systems. Swift uses a high-speed interconnection medium toprovide high data rate transfers by using multiple slower storagedevices in parallel. It scales well when using multiple storage devicesand interconnections, and can use any appropriate storage technology,including high-performance devices such as disk arrays. To address theproblem of partial failures, Swift stores data redundantly. Using theUNIX operating system, we have constructed a simplified prototype ofthe Swift architecture. The prototype provides data rates that aresignificantly faster than access to the local SCSI disk, limited by thecapacity of a single Ethernet segment, or in the case of multipleEthernet segments by the ability of the client to drive them. We haveconstructed a simulation model to demonstrate how the Swiftarchitecture can exploit advances in processor, communication andstorage technology. We consider the effects of processor speed,interconnection capacity, and multiple storage agents on theutilization of the components and the data rate of the system. We showthat the data rates scale well in the number of storage devices, andthat by replacing the most highly stressed components by more powerfulones the data rates of the entire system increase significantly.

 

6. Ceph: a scalable, high-performance distributed file system

Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn
We have developed Ceph, a distributed file system thatprovides excellent performance, reliability, and scalability. Cephmaximizes the separation between data and metadata management byreplacing allocation tables with a pseudo-random data distributionfunction (CRUSH) designed for heterogeneous and dynamic clusters ofunreliable object storage devices (OSDs). We leverage deviceintelligence by distributing data replication, failure detection andrecovery to semi-autonomous OSDs running a specialized local objectfile system. A dynamic distributed metadata cluster provides extremelyefficient metadata management and seamlessly adapts to a wide range ofgeneral purpose and scientific computing file system workloads.Performance measurements under a variety of workloads show that Cephhas excellent I/O performance and scalable metadata management,supporting more than 250,000 metadata operations per second.

 

7. XtreemFS - a case for object-based storage in Grid data management

Felix Hupfeld, Toni Cortes, Björn Kolbeck, Jan Stender, Erich Focht, Matthias Hess, Jesus Malo, Jonathan Marti, Eugenio Cesario
In today's Grids, files are usually managed by Grid datamanagement systems that are superimposed on existing file and storagesystems. In this position paper, we analyze this predominant approachand argue that object-based file systems can be an alternative whenadapted to the characteristics of a Grid environment. We describe howwe are solving the challenge of extending the object-based storagearchitecture for the Grid in XtreemFS, an object-based file system forfederated infrastructures.

 

8. Andrew FS

AFS is anancient distributed filesystem, OpenAFS being an open-sourceimplementation. I couldn't find a good overview paper about either.

 

9. Lustre

There's a bunch of information at arch.lustre.org about the architecture, but I could not find a good overview paper.

 

10. GlusterFS

GlusterFS is a new semi-serious open-source distributed filesystem. Looking at the wiki,it seems to me that the guarantees and semantics of the system are notclear, especially given that certain modules can be stacked indifferent order, both on server and client nodes.

 

11. Sinfonia: A new paradigm for building scalable distributed systems

Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christos Karamanolis
We propose a new paradigm for building scalable distributedsystems. Our approach does not require dealing with message-passingprotocols—a major complication in existing distributed systems.Instead, developers just design and manipulate data structures withinour service called Sinfonia. Sinfonia keeps data for applications on aset of memory nodes, each exporting a linear address space. At the coreof Sinfonia is a novel minitransaction primitive that enables ef?cientand consistent access to data, while hiding the complexities that arisefrom concurrency and failures. Using Sinfonia, we implemented two verydifferent and complex applications in a few months: a cluster ?lesystem and a group communication service. Our implementations performwell and scale to hundreds of machines.

 

III. Non-relational Distributed Databases

 

Also see the Google's Bigtable. This section is a work-in-progress.

1. Dynamo: Amazon’s Highly Available Key-value Store

GiuseppeDeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, PeterVosshall and Werner Vogels
Reliability at massive scale is one of the biggestchallenges we face at Amazon.com, one of the largest e-commerceoperations in the world; even the slightest outage has significantfinancial consequences and impacts customer trust. The Amazon.complatform, which provides services for many web sites worldwide, isimplemented on top of an infrastructure of tens of thousands of serversand network components located in many datacenters around the world. Atthis scale, small and large components fail continuously and the waypersistent state is managed in the face of these failures drives thereliability and scalability of the software systems.

 

This paper presents the design and implementation of Dynamo,a highly available key-value storage system that some of Amazon's coreservices use to provide an "always-on" experience. To achieve thislevel of availability, Dynamo sacrifices consistency under certainfailure scenarios. It makes extensive use of object versioning andapplication-assisted conflict resolution in a manner that provides anovel interface for developers to use.

 

2. Facebook's Cassandra

There's no paper about it, but it's open-source, written in Java. Here's the Wikipedia entry.

 

3. CouchDB

There's no paper about it, but it's open-source, written in Erlang. Here's the Wikipedia entry.

 

4. Project Voldemort

Voldemortis a distributed key-value storage system similar to Amazon's Dynamo,written in Java, available as open-source. It is used at LinkedIn forcertain high-scalability storage problems where simple functionalpartitioning is not sufficient. As their page says, it is still a newsystem which has rough edges, bad error messages, and probably plentyof uncaught bugs.

 

IV. The Lamport Papers

 

Lamport's homepage with a complete list of papers is here.

1. Paxos made simple

L. Lamport
The Paxos algorithm,when presented in plain English,is very simple.

 

2. Reaching Agreement in the Presence of Faults

M. Pease, R. Shostak, L. Lamport
The problem addressed here concerns a set of isolatedprocessors, some unknown subset of which may be faulty, thatcommunicate only by means of two-party messages. Each nonfaultyprocessor has a private value of information that must be communicatedto each other nonfaulty processor. Nonfaulty processors alwayscommunicate honestly, whereas faulty processors may lie. The problem isto devise an algorithm in which processors communicate their own valuesand relay values received from others that allows each nonfaultyprocessor to infer a value for each other processor. The value inferredfor a nonfaulty processor must be that processor's private value, andthe value inferred for a faulty one must be consistent with thecorresponding value inferred by each other nonfaulty processor. It isshown that the problem is solvable for, and only for, n ≥ 3m + 1, wherem is the number of faulty processors and n is the total number. It isalso shown that if faulty processors can refuse to pass on informationbut cannot falsely relay information, the problem is solvable forarbitrary n ≥ m ≥ 0. This weaker assumption can be approximated inpractice using cryptographic methods.

 

3. Time, clocks, and the ordering of events in a distributed system

L. Lamport
The concept of one event happening before another in adistributed system is examined, and is shown to define a partialordering of the events. A distributed algorithm is given forsynchronizing a system of logical clocks which can be used to totallyorder the events. The use of the total ordering is illustrated with amethod for solving synchronization problems. The algorithm is thenspecialized for synchronizing physical clocks, and a bound is derivedon how far out of synchrony the clocks can become.

 

4. The Part-Time Parliament

L. Lamport
Recent archaeological discoveries on the island of Paxosreveal that the parliament functioned despite the peripateticpropensity of its part-time legislators. The legislators maintainedconsistent copies of the parliamentary record, despite their frequentforays from the chamber and the forgetfulness of their messengers. ThePaxon parliament's protocol provides a new way of implementing thestate machine approach to the design of distributed systems.

 

5. Distributed snapshots: determining global states of distributed systems

K.M. Chandy, L. Lamport
This paper presents an algorithm by which a process in adistributed system determines a global state of the system during acomputation. Many problems in distributed systems can be cast in termsof the problem of detecting global states. For instance, the globalstate detection algorithm helps to solve an important class ofproblems: stable property detection. A stable property is one thatpersists: once a stable property becomes true it remains truethereafter. Examples of stable properties are “computation hasterminated,” “ the system is deadlocked” and “all tokens in a tokenring have disappeared.” The stable property detection problem is thatof devising algorithms to detect a given stable property. Global statedetection can also be used for checkpointing.

 

V. Implementation Issues

 

1. A scalable and explicit event delivery mechanism for UNIX

G. Banga, J.C. Mogul, P. Druschel
UNIX applications not wishing to block when doing I/O oftenuse the select() system call, to wait for events on multiple filedescriptors. The select() mechanism works well for small-scaleapplications, but scales poorly as the number of file descriptorsincreases. Many modern applications, such as Internet servers, usehundreds or thousands of file descriptors, and suffer greatly from thepoor scalability of select(). Previous work has shown that while thetraditional implementation of select () can be improved, the poorscalability is inherent in the design. We present a new event-deliverymechanism, which allows the application to register interest in one ormore sources of events, and to efficiently dequeue new events. We showthat this mechanism, which requires only minor changes to applications,performs independently of the number of file descriptors.

 

2. A design framework for highly concurrent systems

Matt Welsh, Steven D. Gribble, Eric A. Brewer, David Culler

 

Building highly concurrent systems, such as large-scaleInternet services, requires managing many information flows at once andmaintaining peak throughput when demand exceeds resource availability.In addition, any platform supporting Internet services must providehigh availability and be able to cope with burstiness of load. Manyapproaches to building concurrent systems have been proposed, whichgenerally fall into the two categories of threaded and eventdrivenprogramming. We propose that threads and events are actually on theends of a design spectrum, and that the best implementation strategyfor these applications is somewhere in between. We present ageneral-purpose design framework for building highly concurrentsystems, based on three design components — tasks, queues, and threadpools — which encapsulate the concurrency, performance, faultisolation, and software engineering benefits of both threads andevents. We present a set of design patterns that can be applied to mapan application onto an implementation using these components. Inaddition, we provide an analysis of several systems (including anInternet services platform and a highly available, distributed,persistent data store) constructed using our framework, demonstratingits benefit for building and reasoning about concurrent applications.

 

3. Comparing and evaluating epoll, select, and poll event mechanisms

Louay Gammo, Tim Brecht, Amol Shukla, and David Pariag
This paper uses a high-performance, event-driven, HTTPserver (the µserver) to compare the performance of the select, poll,and epoll event mechanisms. We subject the µserver to a variety ofworkloads that allow us to expose the relative strengths and weaknessesof each event mechanism.

 

Interestingly, initial results show that the se lect andpoll event mechanisms perform com parably to the epoll event mechanismin the absence of idle connections. Profiling data shows a significantamount of time spent executing a large number of epoll_ctl systemcalls. As a result, we examine a variety of techniques for reducingepoll_ctl overhead including edge-triggered notification, andintroducing a new system call (epoll_ctlv) that aggregates severalepoll_ctl calls into a single call. Our experiments indicate thatalthough these techniques are successful at reducing epoll_ctloverhead, they only improve performance slightly.

 

4. Asynchronous I/O support in Linux 2.5

Suparna Bhattacharya, Steven Pratt, Badari Pulavarty, Janet Morgan
This paper describes the Asynchronous I/O (AIO) support inthe Linux 2.5 kernel, additional functionality available as patchsets,and plans for further improvements. More speci?cally, the followingtopics are treated in some depth:

 

• Asynchronous ?lesystem I/O• Asynchronous direct I/O• Asynchronous vector I/O

As of Linux 2.5, AIO falls into the common mainline pathunderlying all I/O operations, whether synchronous or asynchronous. Theimplications of this, and other signi?cant ways in which the design forAIO in 2.5 differs from the patches that existed for 2.4, are exploredas part of the discussion.

 

5. Linux AIO performance and robustness for enterprise workloads

Suparna Bhattacharya, John Tran, Mike Sullivan, Chris Mason
In this paper we address some of the issues identi?edduring the development and stabilization of Asynchronous I/O (AIO) onLinux 2.6.

 

We start by describing improvements made to optimize the throughput ofstreaming buffered ?lesystem AIO for microbenchmark runs. Next, wediscuss certain tricky issues in ensuring data integrity between AIODirect I/O (DIO) and buffered I/O, and take a deeper look atsynchronized I/O guarantees, concurrent I/O, write-ordering issues andthe improvements resulting from radix-tree based writeback changes inthe Linux VFS.

We then investigate the results of using Linux 2.6 ?lesystem AIO on theperformance metrics for certain enterprise database workloads which areexpected to bene?t from AIO, and mention a few tips on optimizing AIOfor such workloads. Finally, we brie?y discuss the issues aroundworkloads that need to combine asynchronous disk I/O and network I/O.

 

6. Books written by the late W. Richard Stevens

Here's a list of books from Amazon, which includes classics like TCP/IP Illustrated Volume 1-3, Advanced Programming in the UNIX Environment and Unix Network Programming.