hash的工作原理

来源:百度文库 编辑:神马文学网 时间:2024/04/23 19:56:22
from Data Structures & Algorithms in Java chapter 11 Hash tables (这本书 www.infoxa.com 可以下)
In this section we‘ll introduce hash tables and hashing. One important concept is how a range of key values is transformed into a range of array index values. In a hash table this is accomplished with a hash function. However, for certain kinds of keys, no hash function is necessary; the key values can be used directly as array indices. We‘ll look at this simpler situation first and then go on to show how hash functions can be used when keys aren‘t distributed in such an orderly fashion.
Employee Numbers as Keys
Suppose you‘re writing a program to access employee records for a small company with, say, 1,000 employees. Each employee record requires 1,000 bytes of storage. Thus you can store the entire database in only 1 megabyte, which will easily fit in your computer‘s memory
The company‘s personnel director has specified that she wants the fastest possible access to any individual record. Also, every employee has been given a number from 1 (for the founder) to 1,000 (for the most recently hired worker). These employee numbers can be used as keys to access the records; in fact, access by other keys is deemed unnecessary. Employees are seldom laid off, but even when they are, their record remains in the database for reference (concerning retirement benefits and so on). What sort of data structure should you use in this situation?
Keys Are Index Numbers
One possibility is a simple array. Each employee record occupies one cell of the array,and the index number of the cell is the employee number for that record.
As you know, accessing a specified array element is very fast if you know its index number. The clerk looking up Herman Alcazar knows that he is employee number 72, so he enters that number, and the program goes instantly to index number 72 in the array. A single program statement is all that‘s necessary:
1
empRecord rec = databaseArray[72];
It‘s also very quick to add a new item: You insert it just past the last occupied element. The next new record?for Jim Chan, the newly hired employee number 1,001?would go in cell 1,001. Again, a single statement inserts the new record:
1
databaseArray[totalEmployees++] = newRecord;
Presumably the array is made somewhat larger than the current number of employees, to allow room for expansion; but not much expansion is anticipated.
Not Always So Orderly
The speed and simplicity of data access using this array-based database make it very attractive. However, it works in our example only because the keys are unusually well organized. They run sequentially from 1 to a known maximum, and this maximum is a reasonable size for an array. There are no deletions, so memory-wasting gaps don‘t develop in the sequence. New items can be added sequentially at the end of the array, and the array doesn‘t need to be very much larger than the current number of items.
A Dictionary
In many situations the keys are not so well behaved as in the employee database just described. The classic example is a dictionary. If you want to put every word of an English-language dictionary, from a to zyzzyva (yes, it‘s a word), into your computer‘s memory, so they can be accessed quickly, a hash table is a good choice.
A similar widely used application for hash tables is in computer-language compilers, which maintain a symbol table in a hash table. The symbol table holds all the variable and function names made up by the programmer, along with the addresses where they can be found in memory. The program needs to access these names very quickly, so a hash table is the preferred data structure.
Let‘s say we want to store a 50,000-word English-language dictionary in main memory. You would like every word to occupy its own cell in a 50,000-cell array, so you can access the word using an index number. This will make access very fast. But what‘s the relationship of these index numbers to the words? Given the word morphosis, for example, how do we find its index number?
Converting Words to Numbers
What we need is a system for turning a word into an appropriate index number. To begin,we know that computers use various schemes for representing individual characters as numbers. One such scheme is the ASCII code, in which a is 97, b is 98, and so on, up to 122 for z.
However, the ASCII code runs from 0 to 255, to accommodate capitals, punctuation, and so on. There are really only 26 letters in English words, so let‘s devise our own code?a simpler one that can potentially save memory space. Let‘s say a is 1, b is 2, c is 3, and so on up to 26 for z. We‘ll also say a blank is 0, so we have 27 characters. (Uppercase letters aren‘t used in this dictionary.)
How do we combine the digits from individual letters into a number that represents an entire word? There are all sorts of approaches. We‘ll look at two representative ones, and their advantages and disadvantages.
Add the Digits
A simple approach to converting a word to a number might be to simply add the code numbers for each character. Say we want to convert the word cats to a number. First we convert the characters to digits using our homemade code:
1
2
3
4
c = 3 a = 1 t = 20 s = 19
Then we add them:
1
3 + 1 + 20 + 19 = 43
Thus in our dictionary the word cats would be stored in the array cell with index 43. All the other English words would likewise be assigned an array index calculated by this process.
How well would this work? For the sake of argument, let‘s restrict ourselves to 10-letter words. Then (remembering that a blank is 0), the first word in the dictionary, a, would be coded by
1
0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 = 1
The last potential word in the dictionary would be zzzzzzzzzz (ten Zs). Our code obtained by adding its letters would be
1
26 + 26 + 26 + 26 + 26 + 26 + 26 + 26 + 26 + 26 = 260
Thus the total range of word codes is from 1 to 260. Unfortunately, there are 50,000 words in the dictionary, so there aren‘t enough index numbers to go around. Each array element will need to hold about 192 words (50,000 divided by 260).
Clearly this presents problems if we‘re thinking in terms of our one word-per-array element scheme. Maybe we could put a subarray or linked list of words at each array element. However, this would seriously degrade the access speed. It would be quick to access the array element, but slow to search through the 192 words to find the one we wanted.
So our first attempt at converting words to numbers leaves something to be desired. Too many words have the same index. (For example, was, tin, give, tend, moan, tick, bails, dredge, and hundreds of other words add to 43, as cats does.) We conclude that this approach doesn‘t discriminate enough, so the resulting array has too few elements. We need to spread out the range of possible indices.
Multiply by Powers
Let‘s try a different way to map words to numbers. If our array was too small before, let‘s make sure it‘s big enough. What would happen if we created an array in which every word, in fact every potential word, from a to zzzzzzzzzz, was guaranteed to occupy its own unique array element?
To do this, we need to be sure that every character in a word contributes in a unique way to the final number.
We‘ll begin by thinking about an analogous situation with numbers instead of words. Recall that in an ordinary multi-digit number, each digit position represents a value 10 times as big as the position to its right. Thus 7,546 really means
1
7*1000 + 5*100 + 4*10 + 6*1
Or, writing the multipliers as powers of 10:
1
7*103 + 5*102 + 4*101 + 6**100
(An input routine in a computer program performs a similar series of multiplications and additions to convert a sequence of digits, entered at the keyboard, into a number stored in memory.)
In this system we break a number into its digits, multiply them by appropriate powers of 10 (because there are 10 possible digits), and add the products.
In a similar way we can decompose a word into its letters, convert the letters to their numerical equivalents, multiply them by appropriate powers of 27 (because there are 27 possible characters, including the blank), and add the results. This gives a unique number for every word.
Say we want to convert the word cats to a number. We convert the digits to numbers as shown earlier. Then we multiply each number by the appropriate power of 27, and add the results:
1
3*27^3 + 1*27^2 + 20*27^1 + 19*27^0
Calculating the powers gives
1
3*19,683 + 1*729 + 20*27 + 19*1
and multiplying the letter codes times the powers yields
1
59,049 + 729 + 540 + 19
which sums to 60,337.
This process does indeed generate a unique number for every potential word. We just calculated a four-letter word. What happens with larger words? Unfortunately the range of numbers becomes rather large. The largest 10-letter word, zzzzzzzzzz, translates into
1
26*27^9 + 26*27^8 + 26*27^7 + 26*27^6 + 26*27^5 + 26*27^4 + 26*27^3 +26*27^2 + 26*27^1 +26*27^0
Just by itself, 279 is more than 7,000,000,000,000, so you can see that the sum will be huge. An array stored in memory can‘t possibly have this many elements.
The problem is that this scheme assigns an array element to every potential word, whether it‘s an actual English word or not. Thus there are cells for aaaaaaaaaa, aaaaaaaaab, aaaaaaaaac, and so on, up to zzzzzzzzzz. Only a small fraction of these are necessary for real words, so most array cells are empty.
Our first scheme?adding the numbers?generated too few indices. This latest scheme?adding the numbers times powers of 27?generates too many.
Hashing
What we need is a way to compress the huge range of numbers we obtain from the numbers-multiplied-by-powers system into a range that matches a reasonably sized array.
How big an array are we talking about for our English dictionary? If we only have 50,000 words, you might assume our array should have approximately this many elements. However, it turns out we‘re going to need an array with about twice this many cells. (It will become clear later why this is so.) So we need an array with 100,000 elements.
Thus we look for a way to squeeze a range of 0 to more than 7,000,000,000,000 into the range 0 to 100,000. A simple approach is to use the modulo operator (%), which finds the remainder when one number is divided by another.
To see how this works, let‘s look at a smaller and more comprehensible range. Suppose we squeeze numbers in the range 0 to 199 (we‘ll represent them by the variable largeNumber) into the range 0 to 9 (the variable smallNumber). There are 10 numbers in the range of small numbers, so we‘ll say that a variable smallRange has the value 10.
It doesn‘t really matter what the large range is (unless it overflows the program‘s variable size). The Java expression for the conversion is
1
smallNumber = largeNumber % smallRange;
The remainders when any number is divided by 10 are always in the range 0 to 9; for example, 13%10 gives 3, and 157%10 is 7. We‘ve squeezed the range 0?199 into the range 0?9, a 20-to-1 compression ratio.
A similar expression can be used to compress the really huge numbers that uniquely represent every English word into index numbers that fit in our dictionary array:
1
arrayIndex = hugeNumber % arraySize;
This is an example of a hash function. It hashes (converts) a number in a large range into a number in a smaller range. This smaller range corresponds to the index numbers in an array. An array into which data is inserted using a hash function is called a hash table.
To review: We convert a word into a huge number by multiplying each character in the word by an appropriate power of 27.hugeNumber = ch0*27^9 + ch1*27^8 + ch2*27^7 +ch3*27^6 + ch4*27^5 + ch5*27^4 + ch6*27^3 + ch7*27^2 + ch8*27^1 + ch9*27^0
Then, using the modulo (%) operator, we squeeze the resulting huge range of numbers into a range about twice as big as the number of items we want to store. This is an example of a hash function:
1
2
arraySize = numberWords * 2; arrayIndex = hugeNumber % arraySize;
In the huge range, each number represents a potential data item (an arrangement of letters), but few of these numbers represent actual data items (English words). A hash function transforms these large numbers into the index numbers of a much smaller array. In this array we expect that, on the average, there will be one word for every two cells. Some cells will have no words, and some more than one.
A practical implementation of this scheme runs into trouble because hugeNumber will probably overflow its variable size, even for type long. We‘ll see how to deal with this later.