CSC 321: Data Structures
Fall 2015

HW6: Text & Maps

No late submissions will be accepted for this assignment.


In his classic paper, A Mathematical Theory of Communication, Claude E. Shannon described a method for the stochastic generation of text. That is, he showed how it was possible to generate "random" sequences of characters with specific probabilistic characteristics. This included generating "random" text whose probability distribution matched a given source text. Given a source text, he defined the kth-order approximation of that text as:

0th-order approximation:
Characters are generated completely at random (chosen from the same alphabet as the source text).
1st-order approximation:
Characters are generated at random following the same character frequency distribution as in the source text. That is, if the character 'e' appeared twice as often as 'w' in the source text, then the probability of generating an 'e' should be twice the probability of generating a 'w'.
2nd-order approximation:
Characters are generated at random following the same character-pair frequency distribution as in the source text. That is, if the character 't' was the last one generated, then the next character should be generated based on the frequencies of characters that follow 't' in the source text. For example, if 't' appeared 3 times in the source text: "th...", "tr...", and "th...", then there would be a 2/3 probability that 'h' would be generated next and a 1/3 probability that 'r' would be generated.
  .
  .
  .
kth-order approximation:
Characters are generated at random following the same (k-character)-combination frequency distribution as in the source text. That is, if "c1...ck-1" were the last k-1 characters generated, then the next character should be generated based on the frequencies of characters that follow "c1...ck-1" in the source text. For example, if k = 4 and "thr" appeared 5 times in the source text: "thro...", "thre...", "thro...", "thri", and "thro...", then there would be a 3/5 probability that 'o' would be generated next, a 1/5 probability that 'e' would be generated, and a 1/5 probability that 'i' would be generated.

For this assignment, you are to implement a program that generates kth-order approximations of a source text. The file name of the source text will be entered by the user, followed by repeated requests for approximations of specific lengths. You may choose to create a simple GUI to allow for this interaction, but a text-based interface (such as the one shown below, with user inputs in bold) is sufficient:

Enter the file name: lincoln.txt

Enter the order and the desired text length (-1 -1 to quit): 0 40
xhclpirjysqfdbdtrzcblciuysbvahznegxycsne

Enter the order and the desired text length (-1 -1 to quit): 1 40
autetaedo  haossraeoegrfr aawahyscrtpeci

Enter the order and the desired text length (-1 -1 to quit): 4 40
ich the larget war above dedicated it ca

Enter the order and the desired text length (-1 -1 to quit): 8 40
who died here it is rather for us the li

Enter the order and the desired text length (-1 -1 to quit): -1 -1
BYE

Note that a 0-th order approximation is completely random and so has no commonality with the source text. A 1st-order approximation is similar only in the relative frequencies of the characters. As the order increases, the generated text pulls larger and larger sequences of characters from the source text. Here, the 8th-order approximation is an entire sequence from the source text (the Gettysburg Address).

To simplify your task, the TextProcessor class has been provided for you. This class can be used to read in the contents of a file and store the contents in a single, processed string. For this assignment, we will only concern ourselves with letters and spaces, so all other characters are removed from the file contents. Letters are also made lowercase to simplify processing. You are to build on the TextProcessor class to construct a program that produces the above behavior.

Several points to consider: