CSC 321: Data Structures
Fall 2016

HW5: Text & Maps


In his classic paper, A Mathematical Theory of Communication, Claude E. Shannon described a method for the stochastic generation of text. That is, he showed how it was possible to generate "random" sequences of characters with specific probabilistic characteristics. This included generating "random" text whose probability distribution matched a given source text. Given a source text, he defined the kth-order approximation of that text as:

0th-order approximation:
Characters are generated completely at random (chosen from the same alphabet as the source text).
1st-order approximation:
Characters are generated at random following the same character frequency distribution as in the source text. That is, if the character 'e' appeared twice as often as 'w' in the source text, then the probability of generating an 'e' should be twice the probability of generating a 'w'.
2nd-order approximation:
Characters are generated at random following the same character-pair frequency distribution as in the source text. That is, if the character 't' was the last one generated, then the next character should be generated based on the frequencies of characters that follow 't' in the source text. For example, if 't' appeared 3 times in the source text: in "three" and "teeth", then there would be a 2/3 probability that 'h' would be generated next and a 1/3 probability that 'e' would be generated.
  .
  .
  .
kth-order approximation:
Characters are generated at random following the same (k-character)-combination frequency distribution as in the source text. That is, if "c1...ck-1" were the last k-1 characters generated, then the next character should be generated based on the frequencies of characters that follow "c1...ck-1" in the source text. For example, if k = 4 and "the" appeared 5 times in the source text: in "then", "the ", "weather", "heathen", and "lathe ", then there would be a 2/5 probability that 'n' would be generated next, a 2/5 probability that ' ' would be generated, and a 1/5 probability that 'r' would be generated.

For this assignment, you are to implement a program that generates kth-order approximations of a source text. Initially, the file name of the source text and the approximation order will be entered by the user. Then, repeated requests for approximations of specific lengths will follow, terminated by -1. For example:

Enter the file name and approximation order: lincoln.txt 4

Enter the desired text length (-1 to quit): 20
hat gove to that a l

Enter the desired text length (-1 to quit): 40
ng and dead will proposition this nation

Enter the desired text length (-1 to quit): 40
from the people nation that cannot peopl

Enter the desired text length (-1 to quit): -1
BYE

Note that a 0-th order approximation is completely random, but using only those characters that appear in the source text. A 1st-order approximation is similar only in the relative frequencies of the characters. As the order increases, the generated text pulls larger and larger sequences of characters from the source text. By the time the order reaches 8 or 10, it is likely that entire sections of the source text will be generated.

To simplify your task, the TextProcessor class has been provided for you. This class demonstrates how to read in the contents of a file and store the contents in a single, processed string. For this assignment, we will only concern ourselves with letters and spaces, so all other characters are removed from the file contents. Letters are also made lowercase to simplify processing. You are to build on the TextProcessor class to construct a program that produces the above behavior.

Several points to consider: