CSC 321: Data Structures
Fall 2013

HW1: Tag Clouds


For this assignment, you will work in a two-person team (with partners assigned by the instructor) to complete two related programs. Each team member will be the lead for one of the programs, while providing support on the other program. These programs have many features in common, so working together and sharing ideas is strongly encouraged. You may not consult with classmates or persons outside your team (other than the instructor, of course). A single grade will be given for the pair of assignments, which will be shared equally by teammates.

The two programs, when executed in sequence, will enable you to generate a tag cloud from a specified text file. A tag cloud displays the most frequent words from a file, with the font size for each word varying with respect to its frequency. For example, the following tag cloud shows the ten most frequent words in the 2013 State of the Union address (ignoring common "junk" words like the and is).

Tag Cloud

The words are listed in alphabetical order, with the font size indicating relative frequency. For example, the word "applause" appeared most frequently, followed by "will" and "jobs."

Program 1: WordFrequencies.java

The first program will prompt the user for the name of a text file and process the words in that file. It should keep counts of all the words, and display the word counts in a separate file, ordered by frequency. The file junkWords.txt has been provided for you. It contains 107 common "junk" words, which should not be counted when processing the file. When counting words, capitalization is irrelevant and any characters other than letters and digits should be ignored. Thus, It's, its, and ITS!!! should all be considered equivalent to the base word its.

The input to your program will be the name of the text file to process. The output of your program will be directed to a text file with the same name but the extension out. Thus, if the user specifies the input file speech.txt, then the output file will automatically be named speech.out. The first line in the output file will consist of the name of the input file (in this example, speech.txt). Subsequent lines will contain a word and its corresponding frequency, separated by a single space. The words should be listed in order of decreasing frequency. That is, the first word listed should be the word that appears most frequently in the file, the second word listed should be the second most frequent word, etc. Words with the same frequency should be ordered alphabetically.

For example, if the input file short.txt contained the text on the left, then the output file short.out should contain the stats on the right. Note that the words this, is and a are not listed as they are ignored junk words.

short.txt short.out
This is a sentence.
This is a longer
sentence. This one is a
"much" LONGER sentence!
short.txt
sentence 3
longer 2
much 1
one 1

Program 2: TagCloud.java

The second program will take the stats file produced by the first program and generate a Web page that contains a tag cloud. The inputs to the program are the name of the stats file and the number of words to be displayed in the tag cloud. The output will be directed to a text file with the same name but the extension html. Thus, if the user specifies the input file speech.out, then the output file will automatically be named speech.html.

The content of the output will include HTML tags so that the resulting page can be displayed in a browser. You do not need to understand HTML in order to complete this portion - most of the tags in the page will be the same for all text files (see the example below). The file specific parts of the output will be the name of the input file and the sequence of words. The words in the tag cloud will be listed alphabetically and will differ in size based on their frequencies. The most frequent word will be displayed in a 100pt font, and other words will be proportional to their frequency. For example, a word that appears half as many times as the most frequent word will have font size 50pt. No word will be displayed in a font smaller than 8pt, however.

For example, if the user specified the stats file short.out (as shown above) and chose 3 as the number of words for the tag cloud, then the output file short.html below would be generated.

short.html
<!doctype html> <html> <head> <title>Tag Cloud</title> </head> <body> <h3>Tag Cloud for short.txt</h3> <table border=1><tr><td> <span style='font-size:66pt'>longer</span> &nbsp; <span style='font-size:33pt'>much</span> &nbsp; <span style='font-size:100pt'>sentence</span> &nbsp; </table> </body> </html>

Note that each word in the tag cloud has a corresponding line of the form

<span style='font-size:XXXpt'>YYY</span> &nbsp; where YYY is the word and XXX is its font size.

General Comments

While you should always strive to apply object-oriented design principles (e.g., high cohesion, loose coupling) to your programs, overall design structure is not the focus of this assignment. Instead, these programs are intended to reengage you with Java constructs from CSC222 and so a straightforward approach is acceptable. You may want to implement supporting classes to make the tasks easier, and are free to work from code examples from CSC222. You may not, however, search the Internet for code to adapt (other than small-scale examples of specific tasks, such as how to open an output file).

While overall design will not be considered in the grade for this assignment, basic elements of programming style will be part of the assessment. For example, you should choose meaningful variable names, indent consistently, and provide javadoc comments for each class and method. And, since you will be working as a team, you should clearly document who was the lead and who was the support person for each program. Your programs should also behave reasonably when given bad inputs. For example, if the user enters an input file name that does not exist, the program should not crash but should instead display a warning message.