Design and Implementation of the Second Extended Filesystem

来源:百度文库 编辑:神马文学网 时间:2024/04/27 20:45:44

This paper was first published in the Proceedings of the First DutchInternational Symposium on Linux, ISBN 90-367-0385-9.


Design and Implementation of the Second Extended Filesystem

R閙y Card, Laboratoire MASI--Institut Blaise Pascal,E-Mail: card@masi.ibp.fr, and
Theodore Ts'o, Massachussets Institute of Technology,E-Mail: tytso@mit.edu, and
Stephen Tweedie, University of Edinburgh,E-Mail: sct@dcs.ed.ac.uk

Introduction

Linux is a Unix-like operating system, which runs on PC-386computers. It was implemented first as extension to the Minixoperating system [Tanenbaum 1987] and itsfirst versions included support for the Minix filesystem only.The Minix filesystem contains two serious limitations: blockaddresses are stored in 16 bit integers, thus the maximalfilesystem size is restricted to 64 mega bytes, and directoriescontain fixed-size entries and the maximal file name is 14characters.

We have designed and implemented two new filesystems that areincluded in the standard Linux kernel. These filesystems,called ``Extended File System'' (Ext fs) and ``Second ExtendedFile System'' (Ext2 fs) raise the limitations and add newfeatures.

In this paper, we describe the history of Linux filesystems. Webriefly introduce the fundamental concepts implemented in Unixfilesystems. We present the implementation of the Virtual FileSystem layer in Linux and we detail the Second Extended FileSystem kernel code and user mode tools. Last, we presentperformance measurements made on Linux and BSD filesystems andwe conclude with the current status of Ext2fs and the futuredirections.

History of Linux filesystems

In its very early days, Linux was cross-developed under theMinix operating system. It was easier to share disks betweenthe two systems than to design a new filesystem, so LinusTorvalds decided to implement support for the Minix filesystemin Linux. The Minix filesystem was an efficient and relativelybug-free piece of software.

However, the restrictions in the design of the Minixfilesystem were too limiting, so people started thinking andworking on the implementation of new filesystems in Linux.

In order to ease the addition of new filesystems into theLinux kernel, a Virtual File System (VFS) layer was developed.The VFS layer was initially written by Chris Provenzano, andlater rewritten by Linus Torvalds before it was integrated intothe Linux kernel. It is described in The Virtual File System.

After the integration of the VFS in the kernel, a newfilesystem, called the ``Extended File System'' was implementedin April 1992 and added to Linux 0.96c. This new filesystemremoved the two big Minix limitations: its maximal size was 2giga bytes and the maximal file name size was 255 characters.It was an improvement over the Minix filesystem but someproblems were still present in it. There was no support for theseparate access, inode modification, and data modificationtimestamps. The filesystem used linked lists to keep track offree blocks and inodes and this produced bad performances: asthe filesystem was used, the lists became unsorted and thefilesystem became fragmented.

As a response to these problems, two new filesytems werereleased in Alpha version in January 1993: the Xia filesystemand the Second Extended File System. The Xia filesystem washeavily based on the Minix filesystem kernel code and onlyadded a few improvements over this filesystem. Basically, itprovided long file names, support for bigger partitions andsupport for the three timestamps. On the other hand, Ext2fs wasbased on the Extfs code with many reorganizations and manyimprovements. It had been designed with evolution in mind andcontained space for future improvements. It will be describedwith more details in The SecondExtended File System

When the two new filesystems were first released, theyprovided essentially the same features. Due to its minimaldesign, Xia fs was more stable than Ext2fs. As the filesystemswere used more widely, bugs were fixed in Ext2fs and lots ofimprovements and new features were integrated. Ext2fs is nowvery stable and has become the de-facto standard Linuxfilesystem.

This table contains a summary of the featuresprovided by the different filesystems: Minix FSExt FSExt2 FSXia FS Max FS size 64 MB 2 GB 4 TB 2 GB Max file size 64 MB 2 GB 2 GB 64 MB Max file name 16/30 c 255 c 255 c 248 c 3 times support No No Yes Yes Extensible No No Yes No Var. block size No No Yes No Maintained Yes No Yes ?

Basic File System Concepts

Every Linux filesystem implements a basic set of commonconcepts derivated from the Unix operating system[Bach 1986] files are represented by inodes,directories are simply files containing a list of entries anddevices can be accessed by requesting I/O on special files.

Inodes

Each file is represented by a structure, called an inode.Each inode contains the description of the file: file type,access rights, owners, timestamps, size, pointers to datablocks. The addresses of data blocks allocated to a file arestored in its inode. When a user requests an I/O operation onthe file, the kernel code converts the current offset to ablock number, uses this number as an index in the blockaddresses table and reads or writes the physical block. Thisfigure represents the structure of an inode:

Directories

Directories are structured in a hierarchical tree. Eachdirectory can contain files and subdirectories.

Directories are implemented as a special type of files.Actually, a directory is a file containing a list of entries.Each entry contains an inode number and a file name. When aprocess uses a pathname, the kernel code searchs in thedirectories to find the corresponding inode number. After thename has been converted to an inode number, the inode is loadedinto memory and is used by subsequent requests.

This figure represents a directory:

Links

Unix filesystems implement the concept of link. Severalnames can be associated with a inode. The inode contains afield containing the number associated with the file. Adding alink simply consists in creating a directory entry, where theinode number points to the inode, and in incrementing the linkscount in the inode. When a link is deleted, i.e. when one usesthe rm command to remove a filename, the kerneldecrements the links count and deallocates the inode if thiscount becomes zero.

This type of link is called a hard link and can only be usedwithin a single filesystem: it is impossible to createcross-filesystem hard links. Moreover, hard links can onlypoint on files: a directory hard link cannot be created toprevent the apparition of a cycle in the directory tree.

Another kind of links exists in most Unix filesystems.Symbolic links are simply files which contain a filename. Whenthe kernel encounters a symbolic link during a pathname toinode conversion, it replaces the name of the link by itscontents, i.e. the name of the target file, and restarts thepathname interpretation. Since a symbolic link does not pointto an inode, it is possible to create cross-filesystemssymbolic links. Symbolic links can point to any type of file,even on nonexistent files. Symbolic links are very usefulbecause they don't have the limitations associated to hardlinks. However, they use some disk space, allocated for theirinode and their data blocks, and cause an overhead in thepathname to inode conversion because the kernel has to restartthe name interpretation when it encounters a symbolic link.

Device special files

In Unix-like operating systems, devices can be accessed viaspecial files. A device special file does not use any space onthe filesystem. It is only an access point to the devicedriver.

Two types of special files exist: character and blockspecial files. The former allows I/O operations in charactermode while the later requires data to be written in block modevia the buffer cache functions. When an I/O request is made ona special file, it is forwarded to a (pseudo) device driver. Aspecial file is referenced by a major number, which identifiesthe device type, and a minor number, which identifies the unit.

The Virtual File System

Principle

The Linux kernel contains a Virtual File System layer whichis used during system calls acting on files. The VFS is anindirection layer which handles the file oriented system callsand calls the necessary functions in the physical filesystemcode to do the I/O.

This indirection mechanism is frequently used in Unix-likeoperating systems to ease the integration and the use ofseveral filesystem types [Kleiman 1986,Seltzer et al. 1993].

When a process issues a file oriented system call, thekernel calls a function contained in the VFS. This functionhandles the structure independent manipulations and redirectsthe call to a function contained in the physical filesystemcode, which is responsible for handling the structure dependentoperations. Filesystem code uses the buffer cache functions torequest I/O on devices. This scheme is illustrated in thisfigure:

The VFS structure

The VFS defines a set of functions that every filesystem hasto implement. This interface is made up of a set of operationsassociated to three kinds of objects: filesystems, inodes, andopen files.

The VFS knows about filesystem types supported in thekernel. It uses a table defined during the kernelconfiguration. Each entry in this table describes a filesystemtype: it contains the name of the filesystem type and a pointeron a function called during the mount operation. When afilesystem is to be mounted, the appropriate mount function iscalled. This function is responsible for reading the superblockfrom the disk, initializing its internal variables, andreturning a mounted filesystem descriptor to the VFS. After thefilesystem is mounted, the VFS functions can use thisdescriptor to access the physical filesystem routines.

A mounted filesystem descriptor contains several kinds ofdata: informations that are common to every filesystem types,pointers to functions provided by the physical filesystemkernel code, and private data maintained by the physicalfilesystem code. The function pointers contained in thefilesystem descriptors allow the VFS to access the filesysteminternal routines.

Two other types of descriptors are used by the VFS: an inode descriptorand an open file descriptor. Each descriptor contains informations related tofiles in use and a set of operations provided by the physical filesystem code.While the inode descriptor contains pointers to functions that can be used toact on any file (e.g. create, unlink), the file descriptorscontains pointer to functions which can only act on open files (e.g.read, write).

The Second Extended File System

Motivations

The Second Extended File System has been designed andimplemented to fix some problems present in the first ExtendedFile System. Our goal was to provide a powerful filesystem,which implements Unix file semantics and offers advancedfeatures.

Of course, we wanted to Ext2fs to have excellentperformance. We also wanted to provide a very robustfilesystem in order to reduce the risk of data loss inintensive use. Last, but not least, Ext2fs had to includeprovision for extensions to allow users to benefit from newfeatures without reformatting their filesystem.

``Standard'' Ext2fs features

The Ext2fs supports standard Unix file types: regular files,directories, device special files and symbolic links.

Ext2fs is able to manage filesystems created on really bigpartitions. While the original kernel code restricted themaximal filesystem size to 2 GB, recent work in the VFS layerhave raised this limit to 4 TB. Thus, it is now possible to usebig disks without the need of creating many partitions.

Ext2fs provides long file names. It uses variable lengthdirectory entries. The maximal file name size is 255characters. This limit could be extended to 1012 if needed.

Ext2fs reserves some blocks for the super user(root). Normally, 5% of the blocks are reserved. Thisallows the administrator to recover easily from situationswhere user processes fill up filesystems.

``Advanced'' Ext2fs features

In addition to the standard Unix features, Ext2fs supportssome extensions which are not usually present in Unixfilesystems.

File attributes allow the users to modify the kernelbehavior when acting on a set of files. One can set attributeson a file or on a directory. In the later case, new filescreated in the directory inherit these attributes.

BSD or System V Release 4 semantics can be selected at mounttime. A mount option allows the administrator to choose thefile creation semantics. On a filesystem mounted with BSDsemantics, files are created with the same group id as theirparent directory. System V semantics are a bit more complex: ifa directory has the setgid bit set, new files inherit the groupid of the directory and subdirectories inherit the group id andthe setgid bit; in the other case, files and subdirectories arecreated with the primary group id of the calling process.

BSD-like synchronous updates can be used in Ext2fs. A mountoption allows the administrator to request that metadata(inodes, bitmap blocks, indirect blocks and directory blocks)be written synchronously on the disk when they are modified.This can be useful to maintain a strict metadata consistencybut this leads to poor performances. Actually, this feature isnot normally used, since in addition to the performance lossassociated with using synchronous updates of the metadata, itcan cause corruption in the user data which will not be flaggedby the filesystem checker.

Ext2fs allows the administrator to choose the logical blocksize when creating the filesystem. Block sizes can typically be1024, 2048 and 4096 bytes. Using big block sizes can speed upI/O since fewer I/O requests, and thus fewer disk head seeks,need to be done to access a file. On the other hand, big blockswaste more disk space: on the average, the last block allocatedto a file is only half full, so as blocks get bigger, morespace is wasted in the last block of each file. In addition,most of the advantages of larger block sizes are obtained byExt2 filesystem's preallocation techniques (see sectionPerformance optimizations).

Ext2fs implements fast symbolic links. A fast symbolic linkdoes not use any data block on the filesystem. The target nameis not stored in a data block but in the inode itself. Thispolicy can save some disk space (no data block needs to beallocated) and speeds up link operations (there is no need toread a data block when accessing such a link). Of course, thespace available in the inode is limited so not every link canbe implemented as a fast symbolic link. The maximal size of thetarget name in a fast symbolic link is 60 characters. We planto extend this scheme to small files in the near future.

Ext2fs keeps track of the filesystem state. A special fieldin the superblock is used by the kernel code to indicate thestatus of the file system. When a filesystem is mounted inread/write mode, its state is set to ``Not Clean''. When it isunmounted or remounted in read-only mode, its state is reset to``Clean''. At boot time, the filesystem checker uses thisinformation to decide if a filesystem must be checked. Thekernel code also records errors in this field. When aninconsistency is detected by the kernel code, the filesystem ismarked as ``Erroneous''. The filesystem checker tests this toforce the check of the filesystem regardless of its apparentlyclean state.

Always skipping filesystem checks may sometimes bedangerous, so Ext2fs provides two ways to force checks atregular intervals. A mount counter is maintained in thesuperblock. Each time the filesystem is mounted in read/writemode, this counter is incremented. When it reaches a maximalvalue (also recorded in the superblock), the filesystem checkerforces the check even if the filesystem is ``Clean''. A lastcheck time and a maximal check interval are also maintained inthe superblock. These two fields allow the administrator torequest periodical checks. When the maximal check interval hasbeen reached, the checker ignores the filesystem state andforces a filesystem check.Ext2fs offers tools to tune the filesystem behavior.The tune2fs program can be used to modify:

  • the error behavior. When an inconsistency is detected by the kernel code, the filesystem is marked as ``Erroneous'' and one of the three following actions can be done: continue normal execution, remount the filesystem in read-only mode to avoid corrupting the filesystem, make the kernel panic and reboot to run the filesystem checker.
  • the maximal mount count.
  • the maximal check interval.
  • the number of logical blocks reserved for the super user.

Mount options can also be used to change the kernel error behavior.

An attribute allows the users to request secure deletion onfiles. When such a file is deleted, random data is written inthe disk blocks previously allocated to the file. This preventsmalicious people from gaining access to the previous content ofthe file by using a disk editor.

Last, new types of files inspired from the 4.4 BSDfilesystem have recently been added to Ext2fs. Immutable filescan only be read: nobody can write or delete them. This can beused to protect sensitive configuration files. Append-onlyfiles can be opened in write mode but data is always appendedat the end of the file. Like immutable files, they cannot bedeleted or renamed. This is especially useful for log fileswhich can only grow.

Physical Structure

The physical structure of Ext2 filesystems has been stronglyinfluenced by the layout of the BSD filesystem[McKusick et al. 1984]. Afilesystem is made up of block groups. Block groups areanalogous to BSD FFS's cylinder groups. However, block groupsare not tied to the physical layout of the blocks on the disk,since modern drives tend to be optimized for sequential accessand hide their physical geometry to the operating system.

The physical structure of a filesystem is represented in thistable: Boot
Sector Block
Group 1 Block
Group 2 ...
... Block
Group N

Each block group contains a redundant copy of crucial filesystemcontrol informations (superblock and the filesystem descriptors) andalso contains a part of the filesystem (a block bitmap, an inodebitmap, a piece of the inode table, and data blocks). The structure ofa block group is represented in this table: Super
Block FS
descriptors Block
Bitmap Inode
Bitmap Inode
Table Data
Blocks

Using block groups is a big win in terms of reliability:since the control structures are replicated in each blockgroup, it is easy to recover from a filesystem where thesuperblock has been corrupted. This structure also helps to getgood performances: by reducing the distance between the inodetable and the data blocks, it is possible to reduce the diskhead seeks during I/O on files.

In Ext2fs, directories are managed as linked lists ofvariable length entries. Each entry contains the inode number,the entry length, the file name and its length. By usingvariable length entries, it is possible to implement long filenames without wasting disk space in directories. The structureof a directory entry is shown in this table: inode number entry length name length filename

As an example, The next table represents the structure of adirectory containing three files: file1,long_file_name, and f2: i1 16 05 file1 i2 40 14 long_file_name i3 12 02 f2

Performance optimizations

The Ext2fs kernel code contains many performanceoptimizations, which tend to improve I/O speed when reading andwriting files.

Ext2fs takes advantage of the buffer cache management byperforming readaheads: when a block has to be read, the kernelcode requests the I/O on several contiguous blocks. This way,it tries to ensure that the next block to read will already beloaded into the buffer cache. Readaheads are normally performedduring sequential reads on files and Ext2fs extends them todirectory reads, either explicit reads (readdir(2)calls) or implicit ones (namei kernel directorylookup).

Ext2fs also contains many allocation optimizations. Blockgroups are used to cluster together related inodes and data:the kernel code always tries to allocate data blocks for a filein the same group as its inode. This is intended to reduce thedisk head seeks made when the kernel reads an inode and itsdata blocks.

When writing data to a file, Ext2fs preallocates up to 8adjacent blocks when allocating a new block. Preallocation hitrates are around 75% even on very full filesystems. Thispreallocation achieves good write performances under heavyload. It also allows contiguous blocks to be allocated tofiles, thus it speeds up the future sequential reads.

These two allocation optimizations produce a very good locality of:

  • related files through block groups
  • related blocks through the 8 bits clustering of block allocations.

The Ext2fs library

To allow user mode programs to manipulate the controlstructures of an Ext2 filesystem, the libext2fs library wasdeveloped. This library provides routines which can be used toexamine and modify the data of an Ext2 filesystem, by accessingthe filesystem directly through the physical device.

The Ext2fs library was designed to allow maximal code reusethrough the use of software abstraction techniques. Forexample, several different iterators are provided. A programcan simply pass in a function toext2fs_block_interate(), which will be called for eachblock in an inode. Another iterator function allows anuser-provided function to be called for each file in adirectory.

Many of the Ext2fs utilities (mke2fs,e2fsck, tune2fs, dumpe2fs, anddebugfs) use the Ext2fs library. This greatlysimplifies the maintainance of these utilities, since anychanges to reflect new features in the Ext2 filesystem formatneed only be made in one place--in the Ext2fs library. Thiscode reuse also results in smaller binaries, since the Ext2fslibrary can be built as a shared library image.

Because the interfaces of the Ext2fs library are so abstractand general, new programs which require direct access to theExt2fs filesystem can very easily be written. For example, theExt2fs library was used during the port of the 4.4BSD dump andrestore backup utilities. Very few changes were needed to adaptthese tools to Linux: only a few filesystem dependent functionshad to be replaced by calls to the Ext2fs library.

The Ext2fs library provides access to several classes ofoperations. The first class are the filesystem-orientedoperations. A program can open and close a filesystem, readand write the bitmaps, and create a new filesystem on the disk.Functions are also available to manipulate the filesystem's badblocks list.

The second class of operations affect directories. A callerof the Ext2fs library can create and expand directories, aswell as add and remove directory entries. Functions are alsoprovided to both resolve a pathname to an inode number, and todetermine a pathname of an inode given its inode number.

The final class of operations are oriented around inodes.It is possible to scan the inode table, read and write inodes,and scan through all of the blocks in an inode. Allocation anddeallocation routines are also available and allow user modeprograms to allocate and free blocks and inodes.

The Ext2fs tools

Powerful management tools have been developed for Ext2fs.These utilities are used to create, modify, and correct anyinconsistencies in Ext2 filesystems. The mke2fsprogram is used to initialize a partition to contain an emptyExt2 filesystem.

The tune2fs program can be used to modify the filesystemparameters. As explained in section ``Advanced'' Ext2fs features, it can change the errorbehavior, the maximal mount count, the maximal check interval,and the number of logical blocks reserved for the super user.

The most interesting tool is probably the filesystemchecker. E2fsck is intended to repair filesysteminconsistencies after an unclean shutdown of the system. Theoriginal version of e2fsck was based on LinusTorvald's fsck program for the Minix filesystem. However, thecurrent version of e2fsck was rewritten from scratch,using the Ext2fs library, and is much faster and can correctmore filesystem inconsistencies than the original version.

The e2fsck program is designed to run as quickly aspossible. Since filesystem checkers tend to be disk bound,this was done by optimizing the algorithms used bye2fsck so that filesystem structures are notrepeatedly accessed from the disk. In addition, the order inwhich inodes and directories are checked are sorted by blocknumber to reduce the amount of time in disk seeks. Many ofthese ideas were originally explored by[Bina and Emrath 1989] although they havesince been further refined by the authors.

In pass 1, e2fsck iterates over all of the inodesin the filesystem and performs checks over each inode as anunconnected object in the filesystem. That is, these checks donot require any cross-checks to other filesystem objects.Examples of such checks include making sure the file mode islegal, and that all of the blocks in the inode are valid blocknumbers. During pass 1, bitmaps indicating which blocks andinodes are in use are compiled.

If e2fsck notices data blocks which are claimed bymore than one inode, it invokes passes 1B through 1D to resolvethese conflicts, either by cloning the shared blocks so thateach inode has its own copy of the shared block, or bydeallocating one or more of the inodes.

Pass 1 takes the longest time to execute, since all of theinodes have to be read into memory and checked. To reduce theI/O time necessary in future passes, critical filesysteminformation is cached in memory. The most important example ofthis technique is the location on disk of all of the directoryblocks on the filesystem. This obviates the need to re-readthe directory inodes structures during pass 2 to obtain thisinformation.

Pass 2 checks directories as unconnected objects. Sincedirectory entries do not span disk blocks, each directory blockcan be checked individually without reference to otherdirectory blocks. This allows e2fsck to sort all ofthe directory blocks by block number, and check directoryblocks in ascending order, thus decreasing disk seek time. Thedirectory blocks are checked to make sure that the directoryentries are valid, and contain references to inode numberswhich are in use (as determined by pass 1).

For the first directory block in each directory inode, the`.' and `..' entries are checked to make sure they exist, andthat the inode number for the `.' entry matches the currentdirectory. (The inode number for the `..' entry is not checkeduntil pass 3.)

Pass 2 also caches information concerning the parentdirectory in which each directory is linked. (If a directoryis referenced by more than one directory, the second referenceof the directory is treated as an illegal hard link, and it isremoved).

It is noteworthy to note that at the end of pass 2, nearlyall of the disk I/O which e2fsck needs to perform iscomplete. Information required by passes 3, 4 and 5 are cachedin memory; hence, the remaining passes of e2fsck arelargely CPU bound, and take less than 5-10% of the totalrunning time of e2fsck.

In pass 3, the directory connectivity is checked.E2fsck traces the path of each directory back to theroot, using information that was cached during pass 2. At thistime, the `..' entry for each directory is also checked to makesure it is valid. Any directories which can not be traced backto the root are linked to the /lost+found directory.

In pass 4, e2fsck checks the reference counts forall inodes, by iterating over all the inodes and comparing thelink counts (which were cached in pass 1) against internalcounters computed during passes 2 and 3. Any undeleted fileswith a zero link count is also linked to the/lost+found directory during this pass.

Finally, in pass 5, e2fsck checks the validity ofthe filesystem summary information. It compares the block andinode bitmaps which were constructed during the previous passesagainst the actual bitmaps on the filesystem, and corrects theon-disk copies if necessary.

The filesystem debugger is another useful tool.Debugfs is a powerful program which can be used toexamine and change the state of a filesystem. Basically, itprovides an interactive interface to the Ext2fs library:commands typed by the user are translated into calls to thelibrary routines.

Debugfs can be used to examine the internalstructures of a filesystem, manually repair a corruptedfilesystem, or create test cases for e2fsck.Unfortunately, this program can be dangerous if it is used bypeople who do not know what they are doing; it is very easy todestroy a filesystem with this tool. For this reason,debugfs opens filesytems for read-only access bydefault. The user must explicitly specify the -w flagin order to use debugfs to open a filesystem forread/wite access.

Performance Measurements

Description of the benchmarks

We have run benchmarks to measure filesystem performances.Benchmarks have been made on a middle-end PC, based on ai486DX2 processor, using 16 MB of memory and two 420 MB IDEdisks. The tests were run on Ext2 fs and Xia fs (Linux 1.1.62)and on the BSD Fast filesystem in asynchronous and synchronousmode (FreeBSD 2.0 Alpha--based on the 4.4BSD Litedistribution).

We have run two different benchmarks. The Bonnie benchmarktests I/O speed on a big file--the file size was set to 60 MBduring the tests. It writes data to the file using characterbased I/O, rewrites the contents of the whole file, writes datausing block based I/O, reads the file using character I/O andblock I/O, and seeks into the file. The Andrew Benchmark wasdeveloped at Carneggie Mellon University and has been used atthe University of Berkeley to benchmark BSD FFS and LFS. Itruns in five phases: it creates a directory hierarchy, makes acopy of the data, recursively examine the status of every file,examine every byte of every file, and compile several of thefiles.

Results of the Bonnie benchmark

The results of the Bonnie benchmark are presented in thistable: Char Write
(KB/s) Block Write
(KB/s) Rewrite
(KB/s) Char Read
(KB/s) Block Read
(KB/s) BSD Async 710 684 401 721 888 BSD Sync 699 677 400 710 878 Ext2 fs 452 1237 536 397 1033 Xia fs 440 704 380 366 895

The results are very good in block oriented I/O: Ext2 fsoutperforms other filesystems. This is clearly a benefit of theoptimizations included in the allocation routines. Writes arefast because data is written in cluster mode. Reads are fastbecause contiguous blocks have been allocated to the file. Thusthere is no head seek between two reads and the readaheadoptimizations can be fully used.

On the other hand, performance is better in the FreeBSDoperating system in character oriented I/O. This is probablydue to the fact that FreeBSD and Linux do not use the samestdio routines in their respective C libraries. It seems thatFreeBSD has a more optimized character I/O library and itsperformance is better.

Results of the Andrew benchmark

The results of the Andrew benchmark are presented inthis table: P1 Create
(ms) P2 Copy
(ms) P3 Stat
(ms) P4 Grep
(ms) P5 Compile
(ms) BSD Async 2203 7391 6319 17466 75314 BSD Sync 2330 7732 6317 17499 75681 Ext2 fs 790 4791 7235 11685 63210 Xia fs 934 5402 8400 12912 66997

The results of the two first passes show that Linux benefitsfrom its asynchronous metadata I/O. In passes 1 and 2,directories and files are created and BSD synchronously writesinodes and directory entries. There is an anomaly, though: evenin asynchronous mode, the performance under BSD is poor. Wesuspect that the asynchronous support under FreeBSD is notfully implemented.

In pass 3, the Linux and BSD times are very similar. This isa big progress against the same benchmark run six months ago.While BSD used to outperform Linux by a factor of 3 in thistest, the addition of a file name cache in the VFS has fixedthis performance problem.

In passes 4 and 5, Linux is faster than FreeBSD mainlybecause it uses an unified buffer cache management. The buffercache space can grow when needed and use more memory than theone in FreeBSD, which uses a fixed size buffer cache.Comparison of the Ext2fs and Xiafs results shows that theoptimizations included in Ext2fs are really useful: theperformance gain between Ext2fs and Xiafs is around 5-10%.

Conclusion

The Second Extended File System is probably the most widelyused filesystem in the Linux community. It provides standardUnix file semantics and advanced features. Moreover, thanks tothe optimizations included in the kernel code, it is robust andoffers excellent performance.

Since Ext2fs has been designed with evolution in mind, itcontains hooks that can be used to add new features. Somepeople are working on extensions to the current filesystem:access control lists conforming to the Posix semantics[IEEE 1992], undelete, and on-the-flyfile compression.

Ext2fs was first developed and integrated in the Linuxkernel and is now actively being ported to other operatingsystems. An Ext2fs server running on top of the GNU Hurd hasbeen implemented. People are also working on an Ext2fs port inthe LITES server, running on top of the Mach microkernel[Accetta et al. 1986], andin the VSTa operating system. Last, but not least, Ext2fs is animportant part of the Masix operating system[Card et al. 1993],currently under development by one of the authors.

Acknowledgments

The Ext2fs kernel code and tools have been written mostly bythe authors of this paper. Some other people have alsocontributed to the development of Ext2fs either by suggestingnew features or by sending patches. We want to thank thesecontributors for their help.

References

[Accetta et al. 1986]M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, andM. Young.Mach: A New Kernel Foundation For UNIX Development.In Proceedings of the USENIX 1986 Summer Conference, June 1986.

[Bach 1986]M. Bach.The Design of the UNIX Operating System.Prentice Hall, 1986.

[Bina and Emrath 1989]E. Bina and P. Emrath.A Faster fsck for BSD Unix.In Proceedings of the USENIX Winter Conference, January 1989.

[Card et al. 1993]R. Card, E. Commelin, S. Dayras, and F. M関el.The MASIX Multi-Server Operating System.In OSF Workshop on Microkernel Technology for Distributed Systems,June 1993.

[IEEE 1992]SECURITY INTERFACE for the Portable Operating System Interface forComputer Environments - Draft 13.Institute of Electrical and Electronics Engineers, Inc, 1992.

[Kleiman 1986]S. Kleiman.Vnodes: An Architecture for Multiple File System Typesin Sun UNIX.In Proceedings of the Summer USENIX Conference, pages 260--269,June 1986.

[McKusick et al. 1984]M. McKusick, W. Joy, S. Leffler, and R. Fabry.A Fast File System for UNIX.ACM Transactions on Computer Systems, 2(3):181--197, August1984.

[Seltzer et al. 1993]M. Seltzer, K. Bostic, M. McKusick, and C. Staelin.An Implementation of a Log-Structured File System forUNIX.In Proceedings of the USENIX Winter Conference, January 1993.

[Tanenbaum 1987]A. Tanenbaum.Operating Systems: Design and Implementation.Prentice Hall, 1987.


Thanks to Michael Johnson for HTMLizing it (originally for use inthe KernelHacker's Guide).