HW 2: Experimentation with Word Frequencies

Name: _________________________________________

CSC 221: Computer Programming I Spring 2008

HW 2: Experimentation with Word Frequencies

This assignment will demonstrate the use of computer programs as experimental tools for solving problems. Utilizing the SequenceGenerator class, you will perform experiments with word distributions and attempt to estimate the number of words with certain properties. Many of the assignments in this course will similarly emphasize the use of computers for modeling complex phenomena or for analyzing data, and so will help to develop empirical skills such as formulating reasonable hypotheses, designing experiments, executing them, and critiquing the results.

Generating Random Sequences

The SequenceGenerator class contains a method named randomSequence that generates a random sequence of letters. The method takes one parameter value, the length of the sequence to be generated, and returns a random letter sequence.

EXERCISE 1: Load this class into the BlueJ IDE and create a SequenceGenerator object (with no parameter). Generate 10 random sequences of 3 letters and list them below. Were any of the 10 sequences real English words? Would you expect them to be?

EXERCISE 2: Similarly, generate 10 random sequences of 4 letters and list them below. Were any of the 10 sequences real English words? Would you expect them to be? Would you expect a 4-letter word be more or less likely than a 3-letter word? Explain.

EXERCISE 3: Now, create a SequenceGenerator object which utilizes an alphabet consisting of one letter, say "X". What would you expect to happen if you generated random letter sequences using this one-letter alphabet? Verify your predictions using the new object.

EXERCISE 4: Generating one sequence at a time can make experimentation with a SequenceGenerator tedious. Add the following method to the source code for the SequenceGenerator class, placed below the existing methods (but above the closing curly-brace).
Compile the modified class, create a SequenceGenerator object, and call your new method. Verify that this method behaves as described.

EXERCISE 5: Next, modify the new displaySequences method by copy-and-pasting the System.out.println statement (all 5 lines of it) 20 times inside the method. This should result in 100 random sequences being printed every time the method is called. Verify that your modified method behaves as described.

Simple Experiments

Using a SequenceGenerator object, you can perform some interesting experiments. In particular, you can use the object as a tool for verifying or disproving hypotheses about word distributions, and to generate further data for analysis. First, consider the total number of unique 3-letter sequences that can be generated. Since each of the three positions in a sequence can be any of the 26 letters, there are 26³ = 17,576 different sequences. Clearly, not all of these sequences form real words. The question arises: how many random 3-letter sequences would you expect to have to generate before you obtain a word?

It so happens that there are approximately 500 3-letter words in the English language (according to a popular online dictionary). Thus, if you generated a random 3-letter sequence, there is a 500 out of 17,576 chance that it will be a word. Since 500/17,567 is approximately 1/35, you might expect 1 out of every 35 sequences to be a word. More accurately, you might expect 28 out of every 1,000 random 3-letter sequences to be words, since 1/35 is approximately 2.8%. This number can be verified experimentally using a SequenceGenerator object.

EXERCISE 6: Use a SequenceGenerator object to generate 1,000 random 3-letter sequences and count how many English words you obtain. List that number below. Hint: Generate the sequences in 10 groups of 100 by repeatedly calling displaySequences and clearing the output screen after each call. Scanning 100 sequences for words can be done in just a few seconds.

Is the number you obtained close to the expected value of 28? If not, generate another 1,000 sequences, average the word counts, and see if the approximation improves.

As the length of the letter sequences increases, the chances of generating a word at random decreases dramatically. For example, there are 1,777 4-letter words in the online dictionary. Thus, the chances of generating a 4-letter word at random are 1 in 257 (1,777/26⁴ = 1,777/456,976 = 1/257). For 5-letter sequences, the chances of generating a word at random is 1 in 4,920 (2,415/26⁵ = 2,415/11,881,376 = 1/4,920) .

EXERCISE 7: Use a SequenceGenerator object to generate 1,000 random 4-letter sequences and count how many English words you obtain. List that number below.

Since there is only a 0.38% chance of generating a 4-letter word at random, you would expect to obtain around 4 words out of 1,000 random 4-letter sequences. Is the number you obtained close to 4? Compared to the case for 3-letter words, would you expect it to take more or fewer sequences to obtain a number close to the expected value? Explain.

Problem Solving via Experimentation

Part of the blame for the scarcity of words among randomly generated sequences falls on letters such as 'q' and 'z'. Since these letters are used so infrequently in English, their inclusion in a random sequence of letters makes a real word extremely unlikely. If we exclude letters such as these, however, we can improve the chances of generating words considerably. For example, the 10 letters that appear most frequently in English text are "etaoinshrd". Random sequences of these letters would appear more likely to produce words.

EXERCISE 8: Create a SequenceGenerator object whose alphabet is limited to the letters "etaoinshrd". Generate 1,000 random 3-letter sequences of these letters and count how many English words you obtain. List that number below.

Generate another 1,000 random letter sequences and count the number of English words. Is this number relatively consistent with the number obtained from the first 1,000 sequences (e.g., within 10% of each other)? If not, calculate the average of these two counts and perform the experiment again. Continue this process until the count obtained from the last 1,000 sequences is close (within 10%) of the average from the previous sequences. Show your results at each stage in the process.

Using your experimental results from the previous exercise, you should now be able to estimate the number of 3-letter words in the English language that use only the letters in "etaoinshrd". The followng general formula applies:

# of words = (# of possible sequences) * (chances that random sequence is word) The total number of 3-letter sequences that use only 10 letters is 10³ = 1,000. If you averaged 150 words out of 1,000 sequences, then 150/1,000 would serve as your estimate of the chances that a random sequence is a word. Plugging these numbers into the above formula, the total number of 3-letter words using only "etaoinshrd" is estimated to be (1,000 * 150/1,000) = 150.

EXERCISE 9: Using the numbers you obtained in EXERCISE 8, estimate the number of 3-letter words in the English language that use only the letters in "etaoinshrd". Show your work in obtaining your estimate.

EXERCISE 10: Using the same approach as above, estimate the number of 4-letter words using only the letters "etaoinshrd". Show your work in obtaining your estimate.
Note: since there are many more 4-letter sequences than there are 3-letter sequences, it may take many more random generations in order to obtain a reasonable estimate. Justify your data.

Hand in a printout of your modified SequenceGenerator class along with these sheets.