readings in distributed system
来源:百度文库 编辑:神马文学网 时间:2024/04/27 22:34:20
- About
- Readings in Distributed Systems
Readings in Distributed Systems
An expanding list of papers on Distributed Systems. If you think something important is missing from this page, please email me.
I. The Google Papers
A complete list of papers written by Googlers is here. The 5 papers below describe the core of their MapReduce/GoogleFS/BigTable system.
1. The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak LeungWe have designed and implemented the Google File System, a scalabledistributed file system for large distributed data-intensiveapplications. It provides fault tolerance while running on inexpensivecommodity hardware, and it delivers high aggregate performance to alarge number of clients.
While sharing many of the same goals as previous distributed filesystems, our design has been driven by observations of our applicationworkloads and technological environment, both current and anticipated,that reflect a marked departure from some earlier file systemassumptions. This has led us to reexamine traditional choices andexplore radically different design points.
The file system has successfully met our storage needs. It is widelydeployed within Google as the storage platform for the generation andprocessing of data used by our service as well as research anddevelopment efforts that require large data sets. The largest clusterto date provides hundreds of terabytes of storage across thousands ofdisks on over a thousand machines, and it is concurrently accessed byhundreds of clients.
In this paper, we present file system interface extensions designed tosupport distributed applications, discuss many aspects of our design,and report measurements from both micro-benchmarks and real world use.
2. Bigtable: A Distributed Storage System for Structured Data
FayChang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E.GruberBigtable is a distributed storage system for managingstructured data that is designed to scale to a very large size:petabytes of data across thousands of commodity servers. Many projectsat Google store data in Bigtable, including web indexing, Google Earth,and Google Finance. These applications place very different demands onBigtable, both in terms of data size (from URLs to web pages tosatellite imagery) and latency requirements (from backend bulkprocessing to real-time data serving). Despite these varied demands,Bigtable has successfully provided a flexible, high-performancesolution for all of these Google products. In this paper we describethe simple data model provided by Bigtable, which gives clients dynamiccontrol over data layout and format, and we describe the design andimplementation of Bigtable.
3. The Chubby Lock Service for Loosely-Coupled Distributed Systems
Mike BurrowsWe describe our experiences with the Chubby lock service,which is intended to provide coarse-grained locking as well as reliable(though low-volume) storage for a loosely-coupled distributed system.Chubby provides an interface much like a distributed file system withadvisory locks, but the design emphasis is on availability andreliability, as opposed to high performance. Many instances of theservice have been used for over a year, with several of them eachhandling a few tens of thousands of clients concurrently. The paperdescribes the initial design and expected use, compares it with actualuse, and explains how the design had to be modified to accommodate thedifferences.
4. Paxos Made Live - An Engineering Perspective
Tushar Chandra, Robert Griesemer, and Joshua RedstoneAbstract We describe our experience building afault-tolerant data-base using the Paxos consensus algorithm. Despitethe existing literature in the field, building such a database provedto be non-trivial. We describe selected algorithmic and engineeringproblems encountered, and the solutions we found for them. Ourmeasurements indicate that we have built a competitive system.
5. MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay GhemawatMapReduce is a programming model and an associatedimplementation for processing and generating large data sets. Usersspecify a map function that processes a key/value pair to generate aset of intermediate key/value pairs, and a reduce function that mergesall intermediate values associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automaticallyparallelized and executed on a large cluster of commodity machines. Therun-time system takes care of the details of partitioning the inputdata, scheduling the program's execution across a set of machines,handling machine failures, and managing the required inter-machinecommunication. This allows programmers without any experience withparallel and distributed systems to easily utilize the resources of alarge distributed system.
Our implementation of MapReduce runs on a large cluster of commoditymachines and is highly scalable: a typical MapReduce computationprocesses many terabytes of data on thousands of machines. Programmersfind the system easy to use: hundreds of MapReduce programs have beenimplemented and upwards of one thousand MapReduce jobs are executed onGoogle's clusters every day.
II. Distributed Filesystems
Also see the Google Filesystem.
1. Petal: Distributed Virtual Disks
Edward K. Lee and Chandramohan A. ThekkathThe ideal storage system is globally accessible, alwaysavailable, provides unlimited performance and capacity for a largenumber of clients, and requires no management. This paper describes thedesign, implementation, and performance of Petal, a system thatattempts to approximate this ideal in practice through a novelcombination of features. Petal consists of a collection ofnetworkconnected servers that cooperatively manage a pool of physicaldisks. To a Petal client, this collection appears as a highly availableblock-level storage system that provides large abstract containerscalled virtual disks. A virtual disk is globally accessibleto all Petalclients on the network. A client can create a virtual disk on demand totap the entire capacity and performance of the underlying physicalresources. Furthermore, additional resources, such as servers anddisks, can be automatically incorporated into Petal. We have an initialPetal prototype consisting of four 225MHz DEC 3000/700 workstationsrunning Digital Unix and connected by a 155 Mbit/s ATM network. Theprototype provides clients with virtual disks that tolerate and recoverfrom disk, server, and network failures. Latency is comparable to alocally attached disk, and throughput scales with the number ofservers. The prototype can achieve I/O rates of up to 3150 requests/secand bandwidth up to 43.1 Mbytes/sec.
2. Frangipani: A Scalable Distributed File System
Chandramohan A. Thekkath, Timothy Mann, Edward K. LeeThe ideal distributed file system wouldprovide all itsusers with coherent, shared access to the same set of files,yet wouldbe arbitrarily scalable to provide more storage space and higherperformance to a growing user community. It would be highly availablein spite of component failures. It would require minimal humanadministration, and administration would not become more complex asmore components were added. Frangipani is a new file system thatapproximates this ideal, yet was relatively easy to build because ofits two-layer structure. The lower layer is Petal (described in anearlier paper), a distributed storage service that providesincrementally scalable, highly available, automatically managed virtualdisks. In the upper layer, multiple machines run the same Frangipanifile system code on top of a shared Petal virtual disk, using adistributed lock service to ensure coherence. Frangipani is meant torun in a cluster of machines that are under a common administration andcan communicate securely. Thus the machines trust one another and theshared virtual disk approach is practical. Of course, a Frangipani filesystem can be exported to untrusted machines using ordinary networkfile access protocols. We have implemented Frangipani on a collectionof Alphas running DIGITAL Unix 4.0. Initial measurements indicate thatFrangipani has excellent single-server performance and scales well asservers are added.
3. GPFS: A shared-disk file system for large computing clusters
Frank Schmuck, Roger HaskinGPFS is IBM’s parallel, shared-disk file system for clustercomputers, available on the RS/6000 SP parallel supercomputer and onLinux clusters. GPFS is used on many of the largest supercomputers inthe world. GPFS was built on many of the ideas that were developed inthe academic community over the last several years, particularlydistributed locking and recovery technology. To date it has been amatter of conjecture how well these ideas scale. We have had theopportunity to test those limits in the context of a product that runson the largest systems in existence. While in many cases existing ideasscaled well, new approaches were necessary in many key areas. Thispaper describes GPFS, and discusses how distributed locking andrecovery techniques were extended to scale to large clusters.
4. Replication in the Harp File System
Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba Shrira, Michael WilliamsHarp uses the primary copy replication technique [1, 26,27]. In this method, client calls are directed to a single primaryserver, which communicates This paper describes the design andimplementation of the with other backup servers and waits for them torespond Harp file system. Harp is a replicated Unix file system beforereplying to the client. The system masks failures by accessible via theVFS interface. It provides highly availperforming a failover algorithmin which an inaccessible able and reliable storage for files andguarantees that file server is removed from service. When a primaryperforms operations are executed atomically in spite of concurrency anoperation, it must inform enough backups to guarantee and failures. Ituses a novel variation of the primary copy that the effects of thatoperation will survive all subsequent replication technique thatprovides good performance befailovers. cause it allows us to trade diskaccesses for network comHarp is one of the first implementations of aprimary munication. Harp is intended to be used within a file sercopyscheme that runs on conventional hardware. It has vice in a distributednetwork; in our current implemensome novel features that allow it toperform well. The key tation, it is accessed via NFS. Preliminaryperformance performance issues are how to provide quick response forresults indicate that Harp provides equal or better response useroperations and how to provide good system capacity time and systemcapacity than an unreplicated implemen-(roughly, the number ofoperations the system can handle tation of NFS that uses Unix filesdirectly. in some time period while still providing good responsetime).
5. Swift: Using distributed disk striping to provide high I/O data rates
Luis-felipe Cabrera, Darrell D. E. LongWe present an I/O architecture, called Swift, thataddresses the problem of data rate mismatches between the requirementsof an application, storage devices, and the interconnection medium. Thegoal of Swift is to support high data rates in general purposedistributed systems. Swift uses a high-speed interconnection medium toprovide high data rate transfers by using multiple slower storagedevices in parallel. It scales well when using multiple storage devicesand interconnections, and can use any appropriate storage technology,including high-performance devices such as disk arrays. To address theproblem of partial failures, Swift stores data redundantly. Using theUNIX operating system, we have constructed a simplified prototype ofthe Swift architecture. The prototype provides data rates that aresignificantly faster than access to the local SCSI disk, limited by thecapacity of a single Ethernet segment, or in the case of multipleEthernet segments by the ability of the client to drive them. We haveconstructed a simulation model to demonstrate how the Swiftarchitecture can exploit advances in processor, communication andstorage technology. We consider the effects of processor speed,interconnection capacity, and multiple storage agents on theutilization of the components and the data rate of the system. We showthat the data rates scale well in the number of storage devices, andthat by replacing the most highly stressed components by more powerfulones the data rates of the entire system increase significantly.
6. Ceph: a scalable, high-performance distributed file system
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos MaltzahnWe have developed Ceph, a distributed file system thatprovides excellent performance, reliability, and scalability. Cephmaximizes the separation between data and metadata management byreplacing allocation tables with a pseudo-random data distributionfunction (CRUSH) designed for heterogeneous and dynamic clusters ofunreliable object storage devices (OSDs). We leverage deviceintelligence by distributing data replication, failure detection andrecovery to semi-autonomous OSDs running a specialized local objectfile system. A dynamic distributed metadata cluster provides extremelyefficient metadata management and seamlessly adapts to a wide range ofgeneral purpose and scientific computing file system workloads.Performance measurements under a variety of workloads show that Cephhas excellent I/O performance and scalable metadata management,supporting more than 250,000 metadata operations per second.
7. XtreemFS - a case for object-based storage in Grid data management
Felix Hupfeld, Toni Cortes, Björn Kolbeck, Jan Stender, Erich Focht, Matthias Hess, Jesus Malo, Jonathan Marti, Eugenio CesarioIn today's Grids, files are usually managed by Grid datamanagement systems that are superimposed on existing file and storagesystems. In this position paper, we analyze this predominant approachand argue that object-based file systems can be an alternative whenadapted to the characteristics of a Grid environment. We describe howwe are solving the challenge of extending the object-based storagearchitecture for the Grid in XtreemFS, an object-based file system forfederated infrastructures.
8. Andrew FS
AFS is anancient distributed filesystem, OpenAFS being an open-sourceimplementation. I couldn't find a good overview paper about either.
9. Lustre
There's a bunch of information at arch.lustre.org about the architecture, but I could not find a good overview paper.
10. GlusterFS
GlusterFS is a new semi-serious open-source distributed filesystem. Looking at the wiki,it seems to me that the guarantees and semantics of the system are notclear, especially given that certain modules can be stacked indifferent order, both on server and client nodes.
11. Sinfonia: A new paradigm for building scalable distributed systems
Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christos KaramanolisWe propose a new paradigm for building scalable distributedsystems. Our approach does not require dealing with message-passingprotocols—a major complication in existing distributed systems.Instead, developers just design and manipulate data structures withinour service called Sinfonia. Sinfonia keeps data for applications on aset of memory nodes, each exporting a linear address space. At the coreof Sinfonia is a novel minitransaction primitive that enables ef?cientand consistent access to data, while hiding the complexities that arisefrom concurrency and failures. Using Sinfonia, we implemented two verydifferent and complex applications in a few months: a cluster ?lesystem and a group communication service. Our implementations performwell and scale to hundreds of machines.
III. Non-relational Distributed Databases
Also see the Google's Bigtable. This section is a work-in-progress.
1. Dynamo: Amazon’s Highly Available Key-value Store
GiuseppeDeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, PeterVosshall and Werner VogelsReliability at massive scale is one of the biggestchallenges we face at Amazon.com, one of the largest e-commerceoperations in the world; even the slightest outage has significantfinancial consequences and impacts customer trust. The Amazon.complatform, which provides services for many web sites worldwide, isimplemented on top of an infrastructure of tens of thousands of serversand network components located in many datacenters around the world. Atthis scale, small and large components fail continuously and the waypersistent state is managed in the face of these failures drives thereliability and scalability of the software systems.
This paper presents the design and implementation of Dynamo,a highly available key-value storage system that some of Amazon's coreservices use to provide an "always-on" experience. To achieve thislevel of availability, Dynamo sacrifices consistency under certainfailure scenarios. It makes extensive use of object versioning andapplication-assisted conflict resolution in a manner that provides anovel interface for developers to use.
2. Facebook's Cassandra
There's no paper about it, but it's open-source, written in Java. Here's the Wikipedia entry.
3. CouchDB
There's no paper about it, but it's open-source, written in Erlang. Here's the Wikipedia entry.
4. Project Voldemort
Voldemortis a distributed key-value storage system similar to Amazon's Dynamo,written in Java, available as open-source. It is used at LinkedIn forcertain high-scalability storage problems where simple functionalpartitioning is not sufficient. As their page says, it is still a newsystem which has rough edges, bad error messages, and probably plentyof uncaught bugs.
IV. The Lamport Papers
Lamport's homepage with a complete list of papers is here.
1. Paxos made simple
L. LamportThe Paxos algorithm,when presented in plain English,is very simple.
2. Reaching Agreement in the Presence of Faults
M. Pease, R. Shostak, L. LamportThe problem addressed here concerns a set of isolatedprocessors, some unknown subset of which may be faulty, thatcommunicate only by means of two-party messages. Each nonfaultyprocessor has a private value of information that must be communicatedto each other nonfaulty processor. Nonfaulty processors alwayscommunicate honestly, whereas faulty processors may lie. The problem isto devise an algorithm in which processors communicate their own valuesand relay values received from others that allows each nonfaultyprocessor to infer a value for each other processor. The value inferredfor a nonfaulty processor must be that processor's private value, andthe value inferred for a faulty one must be consistent with thecorresponding value inferred by each other nonfaulty processor. It isshown that the problem is solvable for, and only for, n ≥ 3m + 1, wherem is the number of faulty processors and n is the total number. It isalso shown that if faulty processors can refuse to pass on informationbut cannot falsely relay information, the problem is solvable forarbitrary n ≥ m ≥ 0. This weaker assumption can be approximated inpractice using cryptographic methods.
3. Time, clocks, and the ordering of events in a distributed system
L. LamportThe concept of one event happening before another in adistributed system is examined, and is shown to define a partialordering of the events. A distributed algorithm is given forsynchronizing a system of logical clocks which can be used to totallyorder the events. The use of the total ordering is illustrated with amethod for solving synchronization problems. The algorithm is thenspecialized for synchronizing physical clocks, and a bound is derivedon how far out of synchrony the clocks can become.
4. The Part-Time Parliament
L. LamportRecent archaeological discoveries on the island of Paxosreveal that the parliament functioned despite the peripateticpropensity of its part-time legislators. The legislators maintainedconsistent copies of the parliamentary record, despite their frequentforays from the chamber and the forgetfulness of their messengers. ThePaxon parliament's protocol provides a new way of implementing thestate machine approach to the design of distributed systems.
5. Distributed snapshots: determining global states of distributed systems
K.M. Chandy, L. LamportThis paper presents an algorithm by which a process in adistributed system determines a global state of the system during acomputation. Many problems in distributed systems can be cast in termsof the problem of detecting global states. For instance, the globalstate detection algorithm helps to solve an important class ofproblems: stable property detection. A stable property is one thatpersists: once a stable property becomes true it remains truethereafter. Examples of stable properties are “computation hasterminated,” “ the system is deadlocked” and “all tokens in a tokenring have disappeared.” The stable property detection problem is thatof devising algorithms to detect a given stable property. Global statedetection can also be used for checkpointing.
V. Implementation Issues
1. A scalable and explicit event delivery mechanism for UNIX
G. Banga, J.C. Mogul, P. DruschelUNIX applications not wishing to block when doing I/O oftenuse the select() system call, to wait for events on multiple filedescriptors. The select() mechanism works well for small-scaleapplications, but scales poorly as the number of file descriptorsincreases. Many modern applications, such as Internet servers, usehundreds or thousands of file descriptors, and suffer greatly from thepoor scalability of select(). Previous work has shown that while thetraditional implementation of select () can be improved, the poorscalability is inherent in the design. We present a new event-deliverymechanism, which allows the application to register interest in one ormore sources of events, and to efficiently dequeue new events. We showthat this mechanism, which requires only minor changes to applications,performs independently of the number of file descriptors.
2. A design framework for highly concurrent systems
Matt Welsh, Steven D. Gribble, Eric A. Brewer, David Culler
Building highly concurrent systems, such as large-scaleInternet services, requires managing many information flows at once andmaintaining peak throughput when demand exceeds resource availability.In addition, any platform supporting Internet services must providehigh availability and be able to cope with burstiness of load. Manyapproaches to building concurrent systems have been proposed, whichgenerally fall into the two categories of threaded and eventdrivenprogramming. We propose that threads and events are actually on theends of a design spectrum, and that the best implementation strategyfor these applications is somewhere in between. We present ageneral-purpose design framework for building highly concurrentsystems, based on three design components — tasks, queues, and threadpools — which encapsulate the concurrency, performance, faultisolation, and software engineering benefits of both threads andevents. We present a set of design patterns that can be applied to mapan application onto an implementation using these components. Inaddition, we provide an analysis of several systems (including anInternet services platform and a highly available, distributed,persistent data store) constructed using our framework, demonstratingits benefit for building and reasoning about concurrent applications.
3. Comparing and evaluating epoll, select, and poll event mechanisms
Louay Gammo, Tim Brecht, Amol Shukla, and David PariagThis paper uses a high-performance, event-driven, HTTPserver (the µserver) to compare the performance of the select, poll,and epoll event mechanisms. We subject the µserver to a variety ofworkloads that allow us to expose the relative strengths and weaknessesof each event mechanism.
Interestingly, initial results show that the se lect andpoll event mechanisms perform com parably to the epoll event mechanismin the absence of idle connections. Profiling data shows a significantamount of time spent executing a large number of epoll_ctl systemcalls. As a result, we examine a variety of techniques for reducingepoll_ctl overhead including edge-triggered notification, andintroducing a new system call (epoll_ctlv) that aggregates severalepoll_ctl calls into a single call. Our experiments indicate thatalthough these techniques are successful at reducing epoll_ctloverhead, they only improve performance slightly.
4. Asynchronous I/O support in Linux 2.5
Suparna Bhattacharya, Steven Pratt, Badari Pulavarty, Janet MorganThis paper describes the Asynchronous I/O (AIO) support inthe Linux 2.5 kernel, additional functionality available as patchsets,and plans for further improvements. More speci?cally, the followingtopics are treated in some depth:
• Asynchronous ?lesystem I/O• Asynchronous direct I/O• Asynchronous vector I/O
As of Linux 2.5, AIO falls into the common mainline pathunderlying all I/O operations, whether synchronous or asynchronous. Theimplications of this, and other signi?cant ways in which the design forAIO in 2.5 differs from the patches that existed for 2.4, are exploredas part of the discussion.
5. Linux AIO performance and robustness for enterprise workloads
Suparna Bhattacharya, John Tran, Mike Sullivan, Chris MasonIn this paper we address some of the issues identi?edduring the development and stabilization of Asynchronous I/O (AIO) onLinux 2.6.
We start by describing improvements made to optimize the throughput ofstreaming buffered ?lesystem AIO for microbenchmark runs. Next, wediscuss certain tricky issues in ensuring data integrity between AIODirect I/O (DIO) and buffered I/O, and take a deeper look atsynchronized I/O guarantees, concurrent I/O, write-ordering issues andthe improvements resulting from radix-tree based writeback changes inthe Linux VFS.
We then investigate the results of using Linux 2.6 ?lesystem AIO on theperformance metrics for certain enterprise database workloads which areexpected to bene?t from AIO, and mention a few tips on optimizing AIOfor such workloads. Finally, we brie?y discuss the issues aroundworkloads that need to combine asynchronous disk I/O and network I/O.