Personal tools
You are here: Home Classes Fall 2004 - Spring 2005 CS 151 Hashing
Navigation
Log in


Forgot your password?
« May 2008 »
Su Mo Tu We Th Fr Sa
123
456789 10
11121314151617
18192021222324
25262728293031
 
Document Actions

Hashing

by admin last modified 2005-05-11 18:12

Introduction to Hashing; Hash Function Design

We've seen that with a balanced binary search tree, we can perform insertions and searches in time O(log n).

Q:  Can we do better than this?

Consider access operations on an array:

array[index] = item;
item = array[index];

The running time of these operations, which take advantage of the random access property of main memory, is O(1).

Could we apply this same technique to the implementation of a set or map?

example  Store a set of employee records using an array, indexed by social security number:

class Employee {
    String name;
    String ssn;
    String title;
    ...

Employee[] employeeTable = new employeeTable[1000000000];

example  Store a set of 10-character names in an array by treating each one as a base-27 number.  Each character in the name is a base-27 digit; for example, let ' ' = 0, 'a' =1, 'b'=2, etc.  So the name "jane      " is equivalent to the number

10x279 + 1x278 + 14x277 + 5x276 + 0x275 + 0x274 + 0x273 + 0x272 + 0x271 + 0x270

To store a name in the set, let

array[name.computeIndex()] = 1;

(A name is removed from the set by assigning a value of 0 to its position in the array.)

Q:  How big is the array?


These examples implement sets of data items with O(1) running time for insertions and searches.  But there is a problem:  the table is too big.

How can this problem be remedied?  Answer:  Modify the key->index function so that its range is much smaller than its domain.

In the case of the Employee records, we could, for example,  form the index by using only the rightmost 3 or 4 digits of the social security number.  Then the table only needs to be 1000 or 10000 records long.

In the case of the set of names, we could form the index by using only the first two characters of the name.  Then the length of the array only needs to be 27*27 = 729.

This is the basic idea of hashing.  A hash table is used to implement either a set of keys or a mapping from a set of keys to a set of objects.  Ideally, it should work like this:
  • Define a function from the set of keys to a range of integers, normally from 0 up to some limit.  This is called a hash function.
  • Create an array whose length is determined by the range of the hash function.  (In the Employee example, the array would be 1000 or 10000 entries long.)
  • The put(key, value) method is implemented by the statement "array[key.hash()] = value"
  • The get(key) method is implemented by the statement "return array[key.hash()]"
  • Both get and put are O(1)

Problem:  What if two different keys have the same hash value?
  • This is called a "collision"
  • Collisions can reduced, but not eliminated, by a good design of the hash function.  (Actually, it may be possible to define a "perfect" (i.e., collision-free) hash function if all the keys are known in advance, but this is not the usual situation.)
  • We need a collision-handling strategy.
  • Insertion and search algorithms must be modified to deal with collisions.

In general, to implement a hash table, two design problems must be solved:
  • How to design a hash function which will minimize the chance of collisions
  • How to handle the collisions which do occur


Hash Function Design

A hash function is a mapping from a set of keys to a range of integers.  The keys can be any objects, although they are most likely to be character strings or integers.  The values of the hash function are integers in the designated range.

A hash function should have the following characteristics:
  • Easy to compute
  • Repeatable
  • Should depend on all of the key
  • All range values should be hit with equal probability
  • Patterns in the key should not be reflected in patterns in the hash values
Hash function design was once an active area of research.  The research found that one of the simplest techniques satisfies all the criteria above, so it's the most frequently used today.  It's known as the division (or division-remainder) method.

Division

Suppose that the key is an integer, and that the hash table is an array of tableSize entries.  Then define

    hash(key) = key % tableSize

This function is easy to compute.  Other characteristics depend on the choice of tableSize.  Some anomalies may occur in certain situations:

Q:    What if tableSize is a power of 2?

Q:    What if tableSize is even and all keys are odd?

The best choice for tableSize is a prime number, or a number with no small prime factors.  Powers of two are especially to be avoided.

Other methods for computing a hash function include the midsquare method and the multiplication method.

Hash Codes

Division is effective if the keys are integers.  What if the keys are not integers, but some other sort of object?  We can interpret any object in memory as an integer, by looking at the string of bits which are used to represent it at the machine level.  But the numbers that result may be very large, too large to store in an integer variable, and too large to perform arithmetic operations on.

What we need is a pre-hash function that converts an object to an integer

We'll consider some ways to do this for strings.
  1. Take all the characters in the string and add (or xor) together their binary representations.
  2. Take all the characters in the string and shift them left by different numbers of bits, then add (or xor) the shifted values.
  3. Interpret the characters in the string as coefficients of a polynomial in one variable, then evaluate the polynomial for some well-chosen value of the variable.
In C, we can get at the binary representation of the key and do something like this:
int prehash(sometype key){
  int result=0;
  int i;
  int limit=(sizeof key)/4;
  int *p = (int*) &key;
  for(i=0; i<limit; i++)
    result ^= p[i];
  return result;
}
In Java, we don't have access to the binary representation of an object.  However, Java provides a method (defined for all Objects) called "hashCode", which uses the method of polynomial interpretation.  For a character string consisting of the characters s[0], s[1], s[2],...,s[k], it returns

s[0]*31k + s[1]*31k-1 + s[2]*31k-2 + ... + s[k-1]*311 + s[k]*310

The result is some 32-bit integer value.  This value can then be mapped into the desired range using division, multiplication, or some other hashing technique.

Here is a method that uses Horner's rule to compute this value for a String:
int hashCode(String s){
    int code = 0;
    for(int k=0; k<s.length(); k++)
        code += 31*code + s.charAt(k);
    return code;
}

 

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: