Hashing
Introduction to Hashing; Hash Function Design
We've seen that with a balanced binary search tree, we can perform insertions and searches in time O(log n).Q: Can we do better than this?
Consider access operations on an array:
array[index] = item;
item = array[index];
The running time of these operations, which take advantage of the random access property of main memory, is O(1).
Could we apply this same technique to the implementation of a set or map?
example Store a set of employee records using an array, indexed by social security number:
class Employee {
String name;
String ssn;
String title;
...
Employee[] employeeTable = new employeeTable[1000000000];
example Store a set of 10-character names in an array by treating each one as a base-27 number. Each character in the name is a base-27 digit; for example, let ' ' = 0, 'a' =1, 'b'=2, etc. So the name "jane " is equivalent to the number
10x279 + 1x278 + 14x277 + 5x276 + 0x275 + 0x274 + 0x273 + 0x272 + 0x271 + 0x270
To store a name in the set, let
array[name.computeIndex()] = 1;
(A name is removed from the set by assigning a value of 0 to its position in the array.)
Q: How big is the array?
These examples implement sets of data items with O(1) running time for insertions and searches. But there is a problem: the table is too big.
How can this problem be remedied? Answer: Modify the key->index function so that its range is much smaller than its domain.
In the case of the Employee records, we could, for example, form the index by using only the rightmost 3 or 4 digits of the social security number. Then the table only needs to be 1000 or 10000 records long.
In the case of the set of names, we could form the index by using only the first two characters of the name. Then the length of the array only needs to be 27*27 = 729.
This is the basic idea of hashing. A hash table is used to implement either a set of keys or a mapping from a set of keys to a set of objects. Ideally, it should work like this:
- Define a function from the set of keys to a range of integers, normally from 0 up to some limit. This is called a hash function.
- Create an array whose length is determined by the range of the hash function. (In the Employee example, the array would be 1000 or 10000 entries long.)
- The put(key, value) method is implemented by the statement "array[key.hash()] = value"
- The get(key) method is implemented by the statement "return array[key.hash()]"
- Both get and put are O(1)
Problem: What if two different keys have the same hash value?
- This is called a "collision"
- Collisions can reduced, but not eliminated, by a good design of the hash function. (Actually, it may be possible to define a "perfect" (i.e., collision-free) hash function if all the keys are known in advance, but this is not the usual situation.)
- We need a collision-handling strategy.
- Insertion and search algorithms must be modified to deal with collisions.
In general, to implement a hash table, two design problems must be solved:
- How to design a hash function which will minimize the chance of collisions
- How to handle the collisions which do occur
Hash Function Design
A hash function is a mapping from a set of keys to a range of integers. The keys can be any objects, although they are most likely to be character strings or integers. The values of the hash function are integers in the designated range.A hash function should have the following characteristics:
- Easy to compute
- Repeatable
- Should depend on all of the key
- All range values should be hit with equal probability
- Patterns in the key should not be reflected in patterns in the hash values
Division
Suppose that the key is an integer, and that the hash table is an array of tableSize entries. Then definehash(key) = key % tableSize
This function is easy to compute. Other characteristics depend on the choice of tableSize. Some anomalies may occur in certain situations:
Q: What if tableSize is a power of 2?
Q: What if tableSize is even and all keys are odd?
The best choice for tableSize is a prime number, or a number with no small prime factors. Powers of two are especially to be avoided.
Other methods for computing a hash function include the midsquare method and the multiplication method.
Hash Codes
Division is effective if the keys are integers. What if the keys are not integers, but some other sort of object? We can interpret any object in memory as an integer, by looking at the string of bits which are used to represent it at the machine level. But the numbers that result may be very large, too large to store in an integer variable, and too large to perform arithmetic operations on.What we need is a pre-hash function that converts an object to an integer
We'll consider some ways to do this for strings.
- Take all the characters in the string and add (or xor) together their binary representations.
- Take all the characters in the string and shift them left by different numbers of bits, then add (or xor) the shifted values.
- Interpret the characters in the string as coefficients of a polynomial in one variable, then evaluate the polynomial for some well-chosen value of the variable.
int prehash(sometype key){In Java, we don't have access to the binary representation of an object. However, Java provides a method (defined for all Objects) called "hashCode", which uses the method of polynomial interpretation. For a character string consisting of the characters s[0], s[1], s[2],...,s[k], it returns
int result=0;
int i;
int limit=(sizeof key)/4;
int *p = (int*) &key;
for(i=0; i<limit; i++)
result ^= p[i];
return result;
}
s[0]*31k + s[1]*31k-1 + s[2]*31k-2 + ... + s[k-1]*311 + s[k]*310
The result is some 32-bit integer value. This value can then be mapped into the desired range using division, multiplication, or some other hashing technique.
Here is a method that uses Horner's rule to compute this value for a String:
int hashCode(String s){
int code = 0;
for(int k=0; k<s.length(); k++)
code += 31*code + s.charAt(k);
return code;
}