CSCI 151 - Lab 7 Processing Web Search Queries

10:00pm, Sunday, April 10

You may work with a partner on this assignment.

For this program you will implement part of a web search engine that orders web pages based on how well they match a search query. A query consists of a list of words and phrases to search for. The best match is the web page with the highest word frequency counts for the words in the query string. Your main class for this assignment should be called ProcessQueries and will be called as follows:

java ProcessQueries urlListFile [count]

The urlListFile should contain a list of URLs, one per line.
The count is an optional parameter that limits the number of results displayed

A URL can be written in many different formats. See the documentation for java.net.URL for details of those supported. You will probably want to consider

file:filename - A file in the current directory
file:/path/to/filename - A file on the same machine
http://remote-machine/path/ - An URL you might use in a web browser

An example urlListFile such as urls-profs might contain:

http://www.cs.oberlin.edu/faculty
http://www.cs.oberlin.edu/~kuperman/
http://www.cs.oberlin.edu/~rms/
http://www.cs.oberlin.edu/~asharp/
http://www.cs.oberlin.edu/~rhoyle/
http://www.cs.oberlin.edu/~wexler/
http://www.cs.oberlin.edu/~bob/
http://www.cs.oberlin.edu/~taw/

You already should have a class WebPageIndex from the previous lab that allows you to represent the index of words on a given webpage. It should also be storing the URL that the index was constructed from in order to display it to the user. Your program will need to go through the urlListFile and attempt to create a WebPageIndex for each of the items listed there. You should include a line in your output that says how many pages you were able to open and form WebPageIndex objects for, and how many pages threw exceptions.

Once you have processed all the URLs in the list (you should gracefully handle invalid URLs), your program will enter a loop as shown below, which prompts the user to enter a search query (or -1 to quit), and then lists all URLs that match the query in order of the best match first and the worst match last. Include each result URL's priority in parenthesis with each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list. In effect, you are performing a Google-like search of the given query on a restricted subset of pages (the ones in your url list file.) Wow!

% java -Xmx4g -classpath jsoup-1.8.3.jar:. ProcessQueries urls-oberlinreview.org 10 
Fetched: 994   Errors: 811   out of 1805
Enter a query on one line or -1 to quit

Search for: computer science
Relevant pages:
(priority =  -11) http://www.oberlinreview.org/article/editorial-athletes-vs-mathletes/
(priority =   -9) http://www.oberlinreview.org/article/visiting-speaker-inspects-myspace-friendships/
(priority =   -8) http://www.oberlinreview.org/thisweek/2011/4/29/
(priority =   -4) http://www.oberlinreview.org/article/luminary-jaron-lanier-unites-digital-media-music-d/
(priority =   -2) http://www.oberlinreview.org/article/timara-program-produces-striking-unusual-cinema/
(priority =   -1) http://www.oberlinreview.org/article/In-The-Locker-Room/
(priority =   -1) http://www.oberlinreview.org/article/editorial-mugging-shouldnt-fray-town-gown-connecti/
(priority =   -1) http://www.oberlinreview.org/article/friedman-eschews-centrism-embraces-partisanship/
(priority =   -1) http://www.oberlinreview.org/article/new-age-artist-iasos-transfixes-concertgoers/
(priority =   -1) http://www.oberlinreview.org/article/frank-lloyd-wright-house-hosts-multimedia-extravag/

Search for: "computer science"
Relevant pages:
(priority =   -3) http://www.oberlinreview.org/article/visiting-speaker-inspects-myspace-friendships/

Search for: -1

% java -Xmx4g -classpath jsoup-1.8.3.jar:. ProcessQueries urls-cs 6 
Fetched: 186   Errors: 4   out of 190
Enter a query on one line or -1 to quit

Search for: csci
Relevant pages:
(priority =  -17) http://occs.cs.oberlin.edu/classes/
(priority =  -17) http://occs.cs.oberlin.edu/classes/
(priority =  -10) http://occs.cs.oberlin.edu/category/events/
(priority =   -9) http://occs.cs.oberlin.edu/
(priority =   -9) http://cs.oberlin.edu/~ctaylor/
(priority =   -7) http://occs.cs.oberlin.edu/category/jobsinternships/

Search for: data structures
Relevant pages:
(priority =   -6) http://occs.cs.oberlin.edu/category/class/
(priority =   -4) http://occs.cs.oberlin.edu/classes/
(priority =   -4) http://occs.cs.oberlin.edu/classes/
(priority =   -4) http://occs.cs.oberlin.edu/2012/11/
(priority =   -4) http://occs.cs.oberlin.edu/classes/electives-schedule/
(priority =   -4) http://occs.cs.oberlin.edu/2012/11/new-course-social-networks/

% java -Xmx4g -classpath jsoup-1.8.3.jar:. ProcessQueries urls-catalog 5 
Fetched: 357   Errors: 2   out of 359
Enter a query on one line or -1 to quit

Search for: "computer science"
Relevant pages:
(priority =   -9) http://catalog.oberlin.edu/content.php?catoid=32&navoid=708
(priority =   -1) http://catalog.oberlin.edu/content.php?catoid=32&navoid=710
(priority =   -1) http://catalog.oberlin.edu/content.php?catoid=32&navoid=703

Search for: extremely difficult class
Relevant pages:
(priority =  -16) http://catalog.oberlin.edu/content.php?catoid=32&navoid=723
(priority =  -12) http://catalog.oberlin.edu/content.php?catoid=32&navoid=713
(priority =   -9) http://catalog.oberlin.edu/content.php?catoid=32&navoid=708
(priority =   -7) http://catalog.oberlin.edu/content.php?catoid=32&navoid=703
(priority =   -6) http://catalog.oberlin.edu/preview_course_nopop.php?catoid=32&coid=67600

To find the results of the query in order, construct a priority queue of WebPageIndex objects, one per web page in the urlListFile. The priority value should initially be computed by adding the counts of the words and phrases in the query. The priority queue can be used to print out the matching URLs in order.

To compare WebPageIndex objects, use a Comparator. Comparator<E> is a java interface which contains the method

public int compare(E item1, E item2);: return a negative number if item1<item2; return a positive integer if item1>item2; return zero if item1==item2

The lab zip file contains a sample Comparator called StringComparator.java. You need to write your own Comparator to compare WebPageIndex objects, based on the current query.

Important Files for this Lab

What you need from the previous Lab:

HTMLScanner.java
MyTreeMap.java
MyTreeSet.java
WebPageIndex.java
jsoup-1.8.3.jar

There is also a jar file containing working versions of these classes that you may elect to use instead.

What you are given in the lab7.zip file:

StringComparator.java - This is a sample Comparator class which I am providing to illustrate how to write a Comparator.
MyPriorityQueue.java - This is a template you could use for implementing MyPriorityQueue if you want. Feel free to discard if desired.
ProcessQueries.java - This is a template you could use for implementing ProcessQueries if you want. Feel free to discard if desired.
csci151lab6.jar - Combined class files from the previous week's lab that you can use if you do not have a working version of HTMLScanner, WebPageIndex, MyTreeMap, and MyTreeSet. It also contains all of the Jsoup classes. You then would just run or compile things using this jar file. You do not need to unpack or otherwise open this file.
```
% javac -classpath csci151lab6.jar:. ProcessQueries.java
% java  -classpath csci151lab6.jar:. ProcessQueries urls-profs
```
WebPageLoader.java - An experimental class that you can use to load web pages in parallel. Ignore this file for now.
some sample url files:
- urls-profs (8 lines) - A list of the Prof's homepages
- urls-jdk-file (12320 lines) - A list of the JavaDoc files, only works on lab machines
- urls-catalog (359 lines) - The Oberlin college catalog for 2013-14
- urls-cs (190 lines) - All CS web pages in our sitemap
- urls-new.oberlin.edu (3589 lines) - All pages on new.oberlin.edu that are reachable from the homepage
- urls-oberlinreview.org (1805 lines) - All pages on the Oberlin Review website that are reachable from the homepage
- urls-oncampus.oberlin.edu (3442 lines) - All pages on the OnCampus website reachable from the homepage (might have issues with SSL pages)

What you need to write:

MyPriorityQueue.java - Write a heap-based implementation of the PriorityQueue (extending AbstractQueue). It should contain an array or ArrayList to hold the data items in the priority queue and a Comparator to compare the relevance of two web pages to a given query. In addition to the interface methods, it needs to contain at least one constructor. Remember that this is supposed to be a min-heap, so the smallest value is at the top.
URLComparator.java - Write a class with a compare method that compares two WebPageIndex objects based on their relevance to a given query (possibly indicated as a parameter to the constructor). You will probably want to include a method that allows you to compute a score for a given WebPageIndex object using the current query.
ProcessQueries.java - This class contains the main method of the application. The program has two basic parts. The first part is to build a list of WebPageIndex objects from the URLs listed in the urlFileList and their frequency counts. The second part is to enter a loop to process a series of user queries.

Part 1 - MyPriorityQueue

Write a heap-based implementation of a PriorityQueue, extending AbstractQueue. It should contain an array or ArrayList to hold the data items in the priority queue and a Comparator to compare the relevance of two web pages to a given query. In addition to the interface methods, it needs to contain at least one constructor, presumably one that takes a Comparator as input. Remember that this is supposed to be a min-heap, so the smallest value is at the top.

There is a skeleton MyPriorityQueue java file provided for you in the zip file, if you want to use it. It is not much more than what Eclipse would provide you, so it is up to you whether you start from scratch or from this file.

Don't forget to test each method you write, ideally as you write it with JUnit tests before moving on to the next part.

You will need the following public and private methods:

public int size(): Return the number of items in the priority queue.
public void clear(): Efficiently empty your heap such that garbage collection can take place. Feel free to use methods in your nested data structures (e.g., "clear()").
public T peek(): Return the highest priority (smallest value) item in the priority queue, without removing it.
public T poll(): Remove and return the highest priority (smallest value) item in the priority queue.; You will need to call your private percolateDown() method on the root after rearranging things.
public boolean offer(T item): Add item in the correct place in the priority queue.; Return true if the item was correctly added -- similar to ArrayList this should always return true.
public Iterator<T> iterator(): Return a new Iterator over the items in the priority queue.; This iterator can be implemented as an anonymous class, and can return the items in any order (including their order in the array.)
public void setComparator( Comparator<T> cmp ): Sets the class's comparator to cmp; This changes the relationship between items in your heap, so you nead to reheapify things. Use the linear-time heapify method we discussed in class.
private void percolateDown( int hole ): Percolate the item at position hole down through the heap.; Be careful to handle the case of single children
private void percolateUp( int hole ): Percolate the item at position hole up through the heap.
private int parent(int x): Return the index of the parent of the node at index x.
private int leftChild(int x): Return the index of the left child of the node at index x.
private int rightChild(int x): Return the index of the right child of the node at index x.

JUnit Tests

Don't forget to write JUnit tests as you go along, to test your priority queue. Create MyPriorityQueueTest.java and work on the tests as you implement the various methods. You can check the behaviour of MyPriorityQueue with that of Java's PriorityQueue. Since your priority queue requires a Comparator in order to construct it, you may want to use the provided StringComparator, and make priority queues out of Strings.

Part 2 - URLComparator

In this section of the lab you will write a Comparator to compare WebPageIndex objects, based on the current query. Comparator<E> is a java interface which contains the method

public int compare(E item1, E item2);: return a negative number if item1<item2; return a positive integer if item1>item2; return zero if item1==item2

The lab zip file contains a sample Comparator called StringComparator.java, which I am providing to illustrate how to write a Comparator. Take a look at it before you tackle the URLComparator.

Once you understand StringComparator, write your own comparator, URLComparator, that compares two WebPageIndex objects based on their relevance to a given query (indicated as a parameter to the constructor).

You should include a method (possibly a public one...) that allows you to compute a score for a given WebPageIndex object using the current query. To remind you, the score you are using is the sum of the the word counts of the each word in the query.

Test your Comparator with JUnit tests before proceeding!

Note: If your URLComparator class is not "recognizing" WebPageIndex, it is probably because you declared your URLComparator class incorrectly. You should use

public class URLComparator implements Comparator<WebPageIndex>

not

public class URLComparator<WebPageIndex> implements Comparator<WebPageIndex>

Do you see why?

Part 3 - ProcessQueries

This class contains the main method of the application. The program has two basic parts. The first part is to build a list of WebPageIndex objects from the URLs listed in the urlFileList. The second part is to enter a loop to process a series of user queries. Create a class ProcessQueries whose main method will have this functionality.

First, implement the part of your program that processes the urlListFile. For each URL read in, use that URL to construct a WebPageIndex. Put all of the WebPageIndex objects in a list. You should do this as a method and not just have all the code in main.

Second, you need to process the user queries. To find the results of the query in best-to-worst order, construct a priority queue of WebPageIndex objects, one per web page in the urlListFile. The priority value should be computed by adding the counts of the words and phrases in the query. The priority queue can be used to print out the matching URLs in order.

For every subsequent search query, use the new query to construct a comparator, and then reheaps your collection of WebPageIndex objects using that comparator.

You should also support searching for phrases contained withing double quotes. String objects have a number of methods like startsWith(), endsWith(), and substring() that I found to be useful when constructing a phrase. You can wait and add this in at the end once you have everything else working if you'd like.

The program then prints out the matching URLs in order from best to worst match until there are no matches or the user specified limit is reached. Continue reading and processing queries in a loop until you reach end of file on System.in or some designated terminator string (e.g., "-1").

Notes:

Your program should handle multiple word queries, and return the best matches based on all words in the query. For example, the query "computer science department" should search each URL's WebPageIndex for all three words to determine the URL's priority.

Don't "drop" the results as you are pulling them out of the heap. Just stick them in a list of some sort as you remove them and add them back in afterwards.

When you change a comparator in the heap, you need to reheapify things. You could just create a new ArrayList and add in all the old items. A better choice would be to reheapify using the linear time algorithm discussed in class.

Your output should include each page's score / priority. We deliberately haven't said how to do this, and there are a couple of different approaches that will work. We hope you can find a way to solve this on your own, but if you are stuck on how to get those priorities, you can ask and we'll point you in a right direction.

You might run into a message that indicates that you have too many files open. Every HTMLScanner you create (or Scanner to read a file) uses one of a limited number of file descriptor slots. You should get rid of the reference to either the Scanner or the HTMLScanner when you are done using it. If you keep them in your WebPageIndex class, you will run out of descriptors to use.

You might also run out of memory on a large url-list file if you run Java in the default manner (it only allocates 64MB of RAM). You can increase the amount of memory given to the Java Virtual Machine with an option to the java command. For example,

    java -Xmx1g ProcessQueries urls-all 10

where the -Xmx indicates that you want it to use more memory, and the 1g indicates to use up to 1 gigabyte.

When you are dealing with a large number of urls, you might want to print out a status message every N items just to let yourself know that things are still progressing. You can be extra fancy if you use "\r" in a System.out.print() statement. Try out the following which you can modify for your own purposes (be sure to take out the sleep):

    public static void main(String[] args) throws InterruptedException {
        for (int i=0; i<10; i++) {
            Thread.sleep(500);
            System.out.print("\rCounting up to " + i );
        }
        System.out.println();
    }

Parallel loading

Once you have your program working correctly, you may have noticed that it takes a while for it to load all of the URLs at the start. There are a couple of ways to address this. One is to cache the results from your fetching -- you could do that by writing out your WebPageIndex objects to disk. However, another way would be to fetch multiple pages simultaneously.

The class WebPageLoader is designed to do just that. If you give it a list of URLs and a number, it will fetch that number of pages simultaneously. This class is still experimental and I found you might get more errors when using this than you do from just fetching things sequentially.

Once your ProcessQueries program is working correctly, you are welcome to try using this class to see if it speeds things up. You should comment out your existing method that creates all of the WebPageIndex objects (you did do that as a method, didn't you?) and add in a call to a new method that uses WebPageLoader to do the fetching. We may need to be able to test your program with the sequential fetching, so leave all of that code in, just call a different method.

You'll want to be careful to not set it up to make too many parallel requests. I found that 5-10 works well, but something like 20 often resulted in many more failed page loads than before.

Handin

Use the handin program to submit a directory containing

All .java files necessary for compiling your code.
The requested MyPriorityQueueTest.java and URLComparatorTest.java files containing your JUnit test
A README file with:

Your name (and your partner's name if you had one)
Any known problems or interesting design decisions that you made
Which classes (if any) you are using from my code

If you work with a partner, please only one of you submit your joint solution using handin.

Improving your search engine

Here are a few suggestions as to how you might improve your search engine.

Ignore case: Convert all terms that are inserted in your WebPageIndex to be lower case. This will allow "Book", "BOOK", and "book" to be considered as the same word for web queries.
AND operator: Require that ALL terms specified in the search query are actually present on the page.
NEAR operator: Allow some way of finding things that are not necessarily phrases, but are within X items of each other.
- operator: Allow the user to specify terms that should *not* be included in the results.
TF-IDF: Term Frequency - Inverse Document Frequency. The idea is that you weight the score for a particular term relative to the number of words in the document as well as the number of documents in which the term appears.
Base things on the frequency of the terms, not just the count of them.
Use serialization to cache WordFrequencyTrees to avoid network costs. (Note, probably shouldn't do this with files written to your home directory on a lab machine.)

If you try any of these, or come up with another technique, write something in your README file to describe what you did, how difficult it was, and how well (if at all) it improved your search results.

Grading Rubric

MyPriorityQueue [/15]
MyPriorityQueueTest [/5]

URLComparator [/4]
URLComparatorTests [/2]

ProcessQueries [/20]

README [/2]
Javadocs [/2]

TOTAL: [/50]

Last Modified: April 08, 2016 by Roberto Hoyle. Original by Benjamin A. Kuperman