Personal tools
You are here: Home Classes Fall 2004 - Spring 2005 CS 151 Lab 11 -- Hash Table
Navigation
Log in


Forgot your password?
« May 2008 »
Su Mo Tu We Th Fr Sa
123
456789 10
11121314151617
18192021222324
25262728293031
 
Document Actions

Lab 11 -- Hash Table

by admin last modified 2005-05-11 18:13

Lab 11 Finding Genes with a Hash Table

The objective of today's lab is to write your own implementation of the Map interface using a hash table with separate chaining and use it to write a biological application program.

part one.  Write the MyHashMap class.

Write an implementation of a hash table using separate chaining.  Call your class MyMap.java.  The class should conform to a simplified version of java's Map interface:  SimpleMap.java.  It contains only the get and put methods from Map.

MyMap should have an array of linked lists as a data member.  You may use the LinkedList class from java.util.  It should also keep a count of the number of keys in the table, so that the load factor can be computed.  Give the array an initial size of 53 slots.

You'll need a hash function.  Compute the hash function of a key by calling its hashCode method (which is defined for all Java objects), then modding by the table size (i.e., the length of the array of linked lists), and then taking the absolute value (use Math.abs).  The hash function can be written as a private method within MyMap.

The linked lists will hold (key, value) pairs.  Define a class called MapEntry which will represent one (key, value) pair.  This could be written as an inner class within MyMap.

In addition to the required get and put methods, write a private method called rehash which rehashes the contents of the map to a new array.  rehash should be called by the put method when the load factor of the table exceeds some fixed threshhold.  When a rehash is needed, let the new table size = 2*(old size) + 1.  A value between 0.5 and 2.0 should work well for the maximum load factor; you might want to do some experimenting here to make a good choice.  Whenever a put operation occurs which would cause this maximum to be exceeded, perform a rehash operation.

Use the equals method to compare keys.  It should not be necessary for the keys to implement Comparable.


part two.  Write a map application.

In this problem, you will write an application program using MyMap.  The program will do a simple search for known signals in DNA sequences.  These "signals" are short subsequences (from the alphabet of nucleotides: A, C, G, and T) that are biologically important because some part of the machinery of the cell recognizes them and binds to them.  In particular, you will be looking for signals called transcription factor binding sites (TFBS); a TFBS is a short sequence that occurs in the "promoter region" adjacent to a particular gene and influences the rate at which that gene produces proteins.  (Don't worry -- you don't need to understand the biology background in order to write the program.)  You will write a program to locate all of the (known) TFBS's in yeast.  You will be given a set of known TFBS sequences, as well as the relevant "promoter" sequences for each yeast gene (yeast has about 6000 genes); you will read both files and set up the appropriate hash tables, then scan through the promoter sequences to determine which TFBS's occur in each sequence.  As you will see below, you will need to set up two hash tables:
  1. a hash table for the TFBS's, where the key to be hashed is the actual TFBS sequence and the value is the TFBS identifier.
  2. a hash table that stores the promoter sequences, where the key to be hashed is the (gene) sequence identifier and the value is the promoter sequence.

The file tfbs.fasta contains the sequences of several hundred known TFBS's in yeast. The file yeast-promoters.fasta contains the DNA sequences of the regions next to each of the genes in yeast.  Both files are in FASTA format, which is a file format used for storing biological sequences.  The format is very simple.  Every entry consists of a sequence identifier and a sequence. The format for a single sequence looks like this:

>YAL001C
CTGTACCACTATAATAATTTATCTTGATCGTATTATGTGTAAAAAAAAGGCGCTTGAAAT
GAAAGCTCCGAAAATTAAAATACTTTGACTGCTTCGGAAAACAAAAACATATAAATAAAT
TTAAAAAATAAACTGTAAAATATTTAAAAACTATTAAAAATATTTTATATTTTTAAAATT
The special character ">" marks the beginning of a new sequence.  The ">" character is followed immediately by the sequence identifier.  The rest of that line is empty. Subsequent lines contain the sequence itself.  Some rules about representing sequences:
  • Case doesn't matter.
  • Spaces and newlines should be ignored.
  • The only allowed characters are A, C, G, and T.

For convenience in this program, all TFBS's are of length 8 (though the full set of known TFBS's vary in length).  To simplify the problem, you can hardcode this length in your program as a constant:

  public static final int TFBS_LENGTH = 8;

Your main program should have two command line arguments:  the name of the tfbs file and the name of the promoter sequence file.  The program should read both  files using a BufferedReader, set up the hash tables, and then enter an interactive loop which reads queries from the user from System.in.  When the user enters the identifier for a gene, the program should look up that identifier in the promoter table to find the corresponding promoter sequence.  It should then scan through the  sequence and search for TFBS's by performing a lookup in the TFBS table on every contiguous subsequence ("window") of length TFBS_LENGTH.  (This search approach is called a "sliding window".)  Print out the identifier and location of every TFBS found.

Notes:  

1.  You can read from System.in by wrapping it in a BufferedReader as follows:

BufferedReader inReader = new BufferedReader(new InputStreamReader(System.in)):

Then use the readLine method to read keyboard input, one line at a time.

2.  A good way to read the fasta files is to write a dna sequence class representing an (id, dna sequence) pair and a fasta reader class which has a method to read one dna sequence from a given buffered reader and return it to the caller.

3.  Test your program first using HashMap from java.util.  When it is working, switch over to MyMap.



part three.  Try some other Map implementations.

We know that polymorphism in Java allows a SimpleMap variable to hold an object of any class which implements the SimpleMap interface.  Try this out by adding an optional third parameter to the gene finding application which is the name of the SimpleMap class which should be used by the application.  When the application instantiates its maps, it should instantiate the requested class and store the map object in a SimpleMap variable.  (If the parameter is missing, instantiate MyMap.)

A class can be instantiated even if your program only knows the name of the class as a String, using the Class class.  The static method Class.forName(String className) returns a Class object representing the class identified by the className string.  You can then call the Class object's newInstance() method to return an Object which is an instance of the class.

So you can write something like this:

SimpleMap map = (SimpleMap) (Class.forName(className)).newInstance();

(Note:  This statement can throw a variety of exceptions which must either be caught or thrown.)

Try your program with MyMap and the two Map implementations found in java.util, TreeMap and HashMap.  You will need to extend TreeMap and HashMap to subclasses (say, SimpleTreeMap and SimpleHashMap) which implement SimpleMap.  The bodies of these two subclasses can be empty; TreeMap and HashMap already contain the get and put methods needed to implement SimpleMap.


part four (optional).  Make a TrieMap.

Modify the Trie from lab 9 to implement the SimpleMap interface:
  1. Add a value field to each node in the Trie.
  2. If a word ends at a certain node, then that node should contain the word's value.
  3. Write the get and put methods to implement the interface.
Then run the gene finding application, using TrieMap to build the tables.


 

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: