Collision Handling; Open Addressing
Collision handling; Open Addressing
Collision handling
A well-chosen hash function can avoid anomalies which result in an excessive number of collisions, but does not eliminate collisions. We need some method for handling collisions when they occur. We'll consider the following techniques:- Open addressing
- Linear probing
- Double hashing
- Rehashing
- Chaining
Open addressing
The hash table is an array of (key, value) pairs. The basic idea is that when a (key, value) pair is inserted into the array, and a collision occurs, the entry is simply inserted at an alternative location in the array. Linear probing, double hashing, and rehashing are all different ways of choosing an alternative location. Each one uses a different "probe sequence" to find an empty location in the table.The simplest probing method is called linear probing. In linear probing, the probe sequence is simply the sequence of consecutive locations, beginning with the hash value of the key. If the end of the table is reached, the probe sequence wraps around and continues at location 0. Only if the table is completely full will the search fail. The logic is as follows:
To insert (key0, value0) into hashTable:
- Compute the hashValue of key0;
- Let probe = hashValue;
- Examine hashTable[probe];
- If it is empty, store (key0, value0) in hashTable[probe] and return;
- If it is not empty, compare its key with key0;
- If they are equal, replace the value in the table with value0 and return;
- If they are not equal, it means that a collision has occurred, so set probe = (probe+1)%tableSize, and repeat beginning at step 3;
Question: If a (key, value) pair is stored at an alternative location, will it be possible to retrieve it later?
To answer this question, let's consider how a search would be performed. To search hashTable for key0, we would:
- Compute the hashValue of key0;
- Let probe = hashValue;
- Examine hashTable[probe];
- If it is empty, the key is not in the table, so return null;
- If it is not empty, compare its key with key0;
- If they are equal, we have a match, so return the value from hashTable[probe];
- If they are not equal, a collision may have occurred, so set probe = (probe+1)%tableSize, and repeat beginning at step 3;
Question: How to remove an entry from the table, given its key?
The natural thing to do would be to search the table for the key. If it is not found, the remove returns without doing anything. If it is found, simply restore the table slot to an empty state.
But there is a problem with this algorithm if collisions have occurred. Suppose that the following sequence of events occurs:
- (key1, value1) maps to slot k in the table and is inserted there.
- (key2, value2) maps to slot k and, because slot k is already filled, is inserted in slot k+1.
- (key1, value1) is deleted from the table.
- A search is made for key2.
- The hash value of key2 is k, so slot k is examined.
- Slot k is empty, so the search algorithm returns null.
- But key2 is really in the table!
We need to distinguish empty slots from slots at which a deletion has occurred. If the insert algorithm arrives at a deleted slot, a new entry may be inserted there, as if the slot were empty. But if the search algorithm arrives at a deleted slot, it must continue probing, as if the slot were full.
A problem with linear probing is that the filled slots tend to occur in clusters, and those clusters tend to grow. The result is that the insertion and search algorithms may have to do a lengthy linear search in order to find an empty slot in the table. Quadratic probing, double hashing, and rehashing are techniques used to avoid clustering. Both of these differ from linear probing only in the way the probe sequence is generated.
With quadratic probing, the probe sequence is based on some quadratic function. One of the simplest methods is the following:
probei+1 = (probei+i+1) % tableSize; // for i = 0, 1, 2, ...For example, if the initial probe is 5, subsequent probes would be 6, 8, 11, 15, 20, 26, etc. (In this case probei = .5*i2 + .5*i +5.) This tends to spread the probes out in the table, avoiding some of the clustering effects, but is still subject to secondary clustering: If two keys have the same initial hash value, there probe sequences will be identical. There is also another problem: Will the probe sequence hit every slot in the table before repeating?
With double hashing, a second hash function is used to generate the probe sequence. The first hash function is used, as before, to determine the initial probe in the table. If there is no collision, the insertion or search terminates without using the second hash function. However, if a collision occurs, the next probe is computed using the formula
probe = (probe+hashValue2) % tableSize;Keys which collide at the first probe are likely to have different values for the second hash function, and therefore have different probe sequences. Clustering is greatly reduced. However, care must be taken that the probe sequence hit every slot in the table before repeating. (Consider the case where tableSize is 10 and hashValue2 is 5.) This will be guaranteed if the tableSize is a prime number, or if hashValue2 and tableSize have no common prime factors.
With rehashing, a sequence of independent hash functions is needed, so each key produces a completely new probe sequence. If hashValue1 results in a collision, hashValue2 is tried. If hashValue2 results in a collision, hashValue3 is tried, and so on, until an empty slot in the table is found. If we run out of hash functions, the algorithm must revert to another technique like linear probing or double hashing.
(Note: The term rehashing is sometimes used to describe a different procedure, which we will discuss in the next class.)