10:00pm, Sunday, April 10
You may work with a partner on this assignment.
For this program you will implement part of a web search engine that orders web pages based on how well they match a search query. A query consists of a list of words and phrases to search for. The best match is the web page with the highest word frequency counts for the words in the query string. Your main class for this assignment should be called ProcessQueries and will be called as follows:
java ProcessQueries urlListFile [count]
A URL can be written in many different formats. See the documentation for java.net.URL for details of those supported. You will probably want to consider
An example urlListFile such as urls-profs
might contain:
http://www.cs.oberlin.edu/faculty http://www.cs.oberlin.edu/~kuperman/ http://www.cs.oberlin.edu/~rms/ http://www.cs.oberlin.edu/~asharp/ http://www.cs.oberlin.edu/~rhoyle/ http://www.cs.oberlin.edu/~wexler/ http://www.cs.oberlin.edu/~bob/ http://www.cs.oberlin.edu/~taw/
You already should have a class WebPageIndex from the previous lab that allows you to represent the index of words on a given webpage. It should also be storing the URL that the index was constructed from in order to display it to the user. Your program will need to go through the urlListFile and attempt to create a WebPageIndex for each of the items listed there. You should include a line in your output that says how many pages you were able to open and form WebPageIndex objects for, and how many pages threw exceptions.
Once you have processed all the URLs in the list (you should gracefully handle invalid URLs), your program will enter a loop as shown below, which prompts the user to enter a search query (or -1 to quit), and then lists all URLs that match the query in order of the best match first and the worst match last. Include each result URL's priority in parenthesis with each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list. In effect, you are performing a Google-like search of the given query on a restricted subset of pages (the ones in your url list file.) Wow!
% java -Xmx4g -classpath jsoup-1.8.3.jar:. ProcessQueries urls-oberlinreview.org 10
Fetched: 994 Errors: 811 out of 1805
Enter a query on one line or -1 to quit
Search for: computer science
Relevant pages:
(priority = -11) http://www.oberlinreview.org/article/editorial-athletes-vs-mathletes/
(priority = -9) http://www.oberlinreview.org/article/visiting-speaker-inspects-myspace-friendships/
(priority = -8) http://www.oberlinreview.org/thisweek/2011/4/29/
(priority = -4) http://www.oberlinreview.org/article/luminary-jaron-lanier-unites-digital-media-music-d/
(priority = -2) http://www.oberlinreview.org/article/timara-program-produces-striking-unusual-cinema/
(priority = -1) http://www.oberlinreview.org/article/In-The-Locker-Room/
(priority = -1) http://www.oberlinreview.org/article/editorial-mugging-shouldnt-fray-town-gown-connecti/
(priority = -1) http://www.oberlinreview.org/article/friedman-eschews-centrism-embraces-partisanship/
(priority = -1) http://www.oberlinreview.org/article/new-age-artist-iasos-transfixes-concertgoers/
(priority = -1) http://www.oberlinreview.org/article/frank-lloyd-wright-house-hosts-multimedia-extravag/
Search for: "computer science"
Relevant pages:
(priority = -3) http://www.oberlinreview.org/article/visiting-speaker-inspects-myspace-friendships/
Search for: -1
% java -Xmx4g -classpath jsoup-1.8.3.jar:. ProcessQueries urls-cs 6
Fetched: 186 Errors: 4 out of 190
Enter a query on one line or -1 to quit
Search for: csci
Relevant pages:
(priority = -17) http://occs.cs.oberlin.edu/classes/
(priority = -17) http://occs.cs.oberlin.edu/classes/
(priority = -10) http://occs.cs.oberlin.edu/category/events/
(priority = -9) http://occs.cs.oberlin.edu/
(priority = -9) http://cs.oberlin.edu/~ctaylor/
(priority = -7) http://occs.cs.oberlin.edu/category/jobsinternships/
Search for: data structures
Relevant pages:
(priority = -6) http://occs.cs.oberlin.edu/category/class/
(priority = -4) http://occs.cs.oberlin.edu/classes/
(priority = -4) http://occs.cs.oberlin.edu/classes/
(priority = -4) http://occs.cs.oberlin.edu/2012/11/
(priority = -4) http://occs.cs.oberlin.edu/classes/electives-schedule/
(priority = -4) http://occs.cs.oberlin.edu/2012/11/new-course-social-networks/
% java -Xmx4g -classpath jsoup-1.8.3.jar:. ProcessQueries urls-catalog 5
Fetched: 357 Errors: 2 out of 359
Enter a query on one line or -1 to quit
Search for: "computer science"
Relevant pages:
(priority = -9) http://catalog.oberlin.edu/content.php?catoid=32&navoid=708
(priority = -1) http://catalog.oberlin.edu/content.php?catoid=32&navoid=710
(priority = -1) http://catalog.oberlin.edu/content.php?catoid=32&navoid=703
Search for: extremely difficult class
Relevant pages:
(priority = -16) http://catalog.oberlin.edu/content.php?catoid=32&navoid=723
(priority = -12) http://catalog.oberlin.edu/content.php?catoid=32&navoid=713
(priority = -9) http://catalog.oberlin.edu/content.php?catoid=32&navoid=708
(priority = -7) http://catalog.oberlin.edu/content.php?catoid=32&navoid=703
(priority = -6) http://catalog.oberlin.edu/preview_course_nopop.php?catoid=32&coid=67600
To find the results of the query in order, construct a priority queue of WebPageIndex objects, one per web page in the urlListFile. The priority value should initially be computed by adding the counts of the words and phrases in the query. The priority queue can be used to print out the matching URLs in order.
To compare WebPageIndex objects, use a Comparator. Comparator<E> is a java interface which contains the method
item1<item2
item1>item2
item1==item2
The lab zip file contains a sample Comparator called StringComparator.java. You need to write your own Comparator to compare WebPageIndex objects, based on the current query.
What you need from the previous Lab:
There is also a jar file containing working versions of these classes that you may elect to use instead.
What you are given in the lab7.zip file:
% javac -classpath csci151lab6.jar:. ProcessQueries.java % java -classpath csci151lab6.jar:. ProcessQueries urls-profs
What you need to write:
Write a heap-based implementation of a PriorityQueue, extending AbstractQueue. It should contain an array or ArrayList to hold the data items in the priority queue and a Comparator to compare the relevance of two web pages to a given query. In addition to the interface methods, it needs to contain at least one constructor, presumably one that takes a Comparator as input. Remember that this is supposed to be a min-heap, so the smallest value is at the top.
There is a skeleton MyPriorityQueue java file provided for you in the zip file, if you want to use it. It is not much more than what Eclipse would provide you, so it is up to you whether you start from scratch or from this file.
Don't forget to test each method you write, ideally as you write it with JUnit tests before moving on to the next part.
You will need the following public and private methods:
percolateDown()
method on
the root after rearranging things.
cmp
hole
down through the heap.
Don't forget to write JUnit tests as you go along, to test your priority queue. Create MyPriorityQueueTest.java and work on the tests as you implement the various methods. You can check the behaviour of MyPriorityQueue with that of Java's PriorityQueue. Since your priority queue requires a Comparator in order to construct it, you may want to use the provided StringComparator, and make priority queues out of Strings.
In this section of the lab you will write a Comparator to compare WebPageIndex objects, based on the current query. Comparator<E> is a java interface which contains the method
item1<item2
item1>item2
item1==item2
Once you understand StringComparator, write your own comparator, URLComparator, that compares two WebPageIndex objects based on their relevance to a given query (indicated as a parameter to the constructor).
You should include a method (possibly a public one...) that allows you to compute a score for a given WebPageIndex object using the current query. To remind you, the score you are using is the sum of the the word counts of the each word in the query.
Test your Comparator with JUnit tests before proceeding!
Note: If your URLComparator class is not "recognizing" WebPageIndex, it is probably because you declared your URLComparator class incorrectly. You should use
This class contains the main method of the application. The program has two basic parts. The first part is to build a list of WebPageIndex objects from the URLs listed in the urlFileList. The second part is to enter a loop to process a series of user queries. Create a class ProcessQueries whose main method will have this functionality.
First, implement the part of your program that processes the urlListFile. For each URL read in, use that URL to construct a WebPageIndex. Put all of the WebPageIndex objects in a list. You should do this as a method and not just have all the code in main.
Second, you need to process the user queries. To find the results of the query in best-to-worst order, construct a priority queue of WebPageIndex objects, one per web page in the urlListFile. The priority value should be computed by adding the counts of the words and phrases in the query. The priority queue can be used to print out the matching URLs in order.
For every subsequent search query, use the new query to construct a comparator, and then reheaps your collection of WebPageIndex objects using that comparator.
You should also support searching for phrases contained withing double quotes. String objects have a number of methods like startsWith(), endsWith(), and substring() that I found to be useful when constructing a phrase. You can wait and add this in at the end once you have everything else working if you'd like.
The program then prints out the matching URLs in order from best to worst match until there are no matches or the user specified limit is reached. Continue reading and processing queries in a loop until you reach end of file on System.in or some designated terminator string (e.g., "-1").
Your program should handle multiple word queries, and return the best matches based on all words in the query. For example, the query "computer science department" should search each URL's WebPageIndex for all three words to determine the URL's priority.
Don't "drop" the results as you are pulling them out of the heap. Just stick them in a list of some sort as you remove them and add them back in afterwards.
When you change a comparator in the heap, you need to reheapify things. You could just create a new ArrayList and add in all the old items. A better choice would be to reheapify using the linear time algorithm discussed in class.
Your output should include each page's score / priority. We deliberately haven't said how to do this, and there are a couple of different approaches that will work. We hope you can find a way to solve this on your own, but if you are stuck on how to get those priorities, you can ask and we'll point you in a right direction.
You might run into a message that indicates that you have too many files open. Every HTMLScanner you create (or Scanner to read a file) uses one of a limited number of file descriptor slots. You should get rid of the reference to either the Scanner or the HTMLScanner when you are done using it. If you keep them in your WebPageIndex class, you will run out of descriptors to use.
You might also run out of memory on a large url-list file if you run Java in the default manner (it only allocates 64MB of RAM). You can increase the amount of memory given to the Java Virtual Machine with an option to the java command. For example,
java -Xmx1g ProcessQueries urls-all 10
where the -Xmx indicates that you want it to use more memory, and the 1g indicates to use up to 1 gigabyte.
When you are dealing with a large number of urls, you might want to print out a status message every N items just to let yourself know that things are still progressing. You can be extra fancy if you use "\r" in a System.out.print() statement. Try out the following which you can modify for your own purposes (be sure to take out the sleep):
public static void main(String[] args) throws InterruptedException { for (int i=0; i<10; i++) { Thread.sleep(500); System.out.print("\rCounting up to " + i ); } System.out.println(); }
Once you have your program working correctly, you may have noticed that it takes a while for it to load all of the URLs at the start. There are a couple of ways to address this. One is to cache the results from your fetching -- you could do that by writing out your WebPageIndex objects to disk. However, another way would be to fetch multiple pages simultaneously.
The class WebPageLoader is designed to do just that. If you give it a list of URLs and a number, it will fetch that number of pages simultaneously. This class is still experimental and I found you might get more errors when using this than you do from just fetching things sequentially.
Once your ProcessQueries program is working correctly, you are welcome to try using this class to see if it speeds things up. You should comment out your existing method that creates all of the WebPageIndex objects (you did do that as a method, didn't you?) and add in a call to a new method that uses WebPageLoader to do the fetching. We may need to be able to test your program with the sequential fetching, so leave all of that code in, just call a different method.
You'll want to be careful to not set it up to make too many parallel requests. I found that 5-10 works well, but something like 20 often resulted in many more failed page loads than before.
Use the handin program to submit a directory containing
If you work with a partner, please only one of you submit your joint solution using handin.
Here are a few suggestions as to how you might improve your search engine.
If you try any of these, or come up with another technique, write something in your README file to describe what you did, how difficult it was, and how well (if at all) it improved your search results.
MyPriorityQueue [/15] MyPriorityQueueTest [/5] URLComparator [/4] URLComparatorTests [/2] ProcessQueries [/20] README [/2] Javadocs [/2] TOTAL: [/50]