CSCI 151 - Lab 10

In class, we have been discussing how Graph structures might can be used to represent relationships between groups of objects. For this assignment, you will be writing a program that allows you to play the "Kevin Bacon Game". A person's "Bacon Number" is computed based on the number of movies of separation between that person and the actor Kevin Bacon. For example, if you are Kevin Bacon, then your Bacon Number is 0. If you were in a movie with Kevin Bacon, your number would be 1. If you weren't in a movie with Kevin Bacon, but were in a movie with someone who was, your Bacon Number would be 2. In short, your Bacon Number is one greater than the smallest Bacon Number of any of your co-stars.

For fun and some additional background, you can try out the Oracle of Bacon at the University of Virginia.

Program Details

You will be writing a class called BaconNumber that will read a data file and allow you to interactively query the system for the Bacon Number and path for any actor in the database. The program should require a single argument which is the filename containing the information on people and the roles they played in a movie. An optional second argument can be used to specify the initial center. After reading in the data, the program should then prompt the user for commands until an end-of-file (CTRL-D) is reached (hasNextLine() will return false).

Similar to what you did in past labs, if the filename argument begins with "http:" you should treat it as an URL and read the file from the network. This will enable you to play the game without having to download the entire file. To open a Scanner from an URL, you just need to do something similar to the following:

Sample command line usage

File Format

The movie data file contains information on what movies a performer appears in. Every line contains information on one person appearing in one movie. The lines are formatted as follows:

The vertical pipe character '|' can be used to determine where the name ends and the title begins. There will only be one '|' on a line and there are no empty names or titles. java.lang.String has a number of methods that can be used to divide up the line. (e.g., split("\\|"))

I have supplied several data files of varying sizes for you to work with. (Don't download them to your CS account, see below.)

Rather than cluttering up your account with these files, you can either use the links above for URLs. Also, once you have your lab folder created, you can run 151lab10setup from a lab machine and you'll get symbolic links to the files in the current directory. Don't submit the imdb files when you handin the assignment.

Other than the small database, you'll almost certainly need to increase the amount of memory allowed via the -Xmx argument.

Commands to be supported

Your program should read in the specified file and in the default case, choose "Kevin Bacon (I)" as the initial center. There are a number of commands you are to support in order query the database and change the center.

find <name>

Find the shortest path from the current center to <name>. The output should be of the format
```
    <name1> -> <movie1> -> <name2> -> <movie2> -> ... -> Kevin Bacon (I) (n)
```
where <name1> is the person specified by the user and the movies and actors in between show the path from that actor to the current center. The '(n)' should indicate the Bacon Number. E.g., "find James Earl Jones" in the "full" database yields
```
    James Earl Jones -> Magic 7, The (2008) (TV) -> Kevin Bacon (I) (1)
    
```
and in the "no-tv-v" set:
```
    James Earl Jones -> Blood Tide (1982) -> Mary Louise Weller 
            -> Animal House (1978) -> Kevin Bacon (I) (2)
    
```
Note that your links may differ, but the path length should be the same.

If someone is disconnected from the center simply print
```
    <name> is unreachable
    
```
recenter <name>

Change the center to the given name if it exists in the database. If the name is not found, print an appropriate message and do not change the center.
avgdist

Calculate the average Bacon Number for the given center among all connected nodes. Your output should be the following
```
    <avg><tab><name><space>(<number reachable>,<number unreachable>)
    
```
The average should only be for the nodes reachable from the center. In the top250 database, I get the following
```
    3.5942556977039737  Kevin Bacon (I) (11803,663)
    
```
and in the "no-tv-v" set I get
```
    2.99402433463726    Kevin Bacon (I) (1833436,118796)
    
```
topcenter <n>

For each actor in the current connected component (i.e., the one containing the current center), calculate the average bacon distance to all actors in that component. (NOTE: this can take a very long time on larger data sets.) Then print a table of the n best centers (i.e., the ones whose average bacon distance is the smallest).

Calculate the average Bacon Number for all entries in the database. NOTE: this can take a very long time on larger data sets.

In the top 250 set, my program finds "Robert Duvall (11803,663)" is the best center (~2.699) and the worst center is "Kumeko Otowa (11803,663)" (~6.378).

Here's the output from my running topcenter 5 on the top250 dataset:
```
Enter a command: topcenter 5 
2.6989748369058715  robert duvall
2.7369312886554265  harrison ford (i)
2.741930017792087   robert de niro
2.776666949080742   john ratzenberger
2.798017453189867   alec guinness
    
```

table - print a table of the counts of bacon numbers for the given center from 0 up to the longest.

In the top250 database I get:

    Table of distances for Kevin Bacon (I)
    Number    0:           1
    Number    1:          87
    Number    2:         539
    Number    3:        4462
    Number    4:        5786
    Number    5:         840
    Number    6:          88
    Unreachable:         663

in the no-tv-v database I get:


      Table of distances for Kevin Bacon (I)
      Number	0:	1
      Number	1:	3344
      Number	2:	408925
      Number	3:	1425751
      Number	4:	349704
      Number	5:	30061
      Number	6:	3482
      Number	7:	380
      Number	8:	92
      Number	9:	12
      Unreachable: 	164815

and for the full database I get:

      Table of distances for Kevin Bacon (I)
      Number	0:	1
      Number	1:	5920
      Number	2:	646684
      Number	3:	1653925
      Number	4:	289613
      Number	5:	24138
      Number	6:	2738
      Number	7:	361
      Number	8:	64
      Number	9:	6
      Unreachable:	176859

Additional commands

You may opt to include additional other commands for consideration towards extra credit. For any additional commands you implement, you should document them in the README file. Be sure to explain what it does and how someone could use it.

Notes

The longest Bacon Number I found in the 'imdb.no-tv-v.txt' dataset for Kevin Bacon was 9 ("Andrea Parlato" and others). "Kevin Bacon (I)" has an average distance value of ~2.994 while "Sean Connery" has ~2.955 indicating that he is a better center than Kevin Bacon. The Oracle of Bacon has a top 1000 list of centers which could be used to search for better values.

Programming Tips

As we have been discussing graphs, It should be no surprise that a good way to represent these acting relationships would be through a graph. There are a number of ways in which this can be done, however, if you want to maintain a simple graph you might want to have both movies and actors be vertices and the edges simply being relationships between them.

While an undirected graph could be used, the resulting path length will be double the Bacon Number. You would need to divide the path length by 2 or use weights of 0.5 for the edges. Another technique would be to create a directed graph and weight the paths from actors to movies as 0 and movies to actors as 1. Then, using Dijkstra's algorithm, you can find the shortest path where all actors and actresses that are listed for a movie can be consider equally.

Remember that it is best to build and test your program incrementally. Construct your Graph class and be sure to include test cases in the main method.

If you decide to either use or model part of your implementation off of what is in the book, be sure to give proper credit in the methods or comments at the start of the file.

You can improve your results by appending a "(I)" to a name and retrying the operation if it isn't found in the database before giving up. (IMDB has been adding that to the end of a number of entries.)

What to Hand In

Acknowledgments

Information courtesy of The Internet Movie Database (http://www.imdb.com/). Used with permission. The data should only be used for personal and non-commercial purposes.

CSCI 151 - Lab 10 Everything is better with Bacon

Introduction