Author Identification via Letter Frequencies


Did Charles Dickens have a penchant for using words with the letter 'W'? Did Louisa May Alcott find words with the letter 'Q' simply irresistible? And did Virginia Woolf really avoid words with 'Z' like the plague? If these questions seem a bit fanciful, one should at least consider the case of the French author who purportedly wrote an entire novel (recently translated to English) avoiding the letter 'E'!

If these and other authors actually favor words with certain letters over others, this information might be useful in identifying the authors of anonymous works of literature (or works whose authorship is disputed). For this assignment, you will write a C++ program that counts the number of letter occurrences in a text. In particular, your program will read in a text file, count the number of times each letter appears in the text, and write out the count and relative frequency (as a percentage) of each letter. In order to perform this analysis, your code will need to do the following:

Your program should display the letter count and relative frequency of each letter, as in the sample execution below. Note that the letter statistics are displayed in two columns, and the letter counts and relative frequencies are aligned down the columns.

Enter the name of the file to be analyzed: sample.txt
A:    32 ( 4.6 %)  
B:     0 ( 0.0 %)  
C:    48 ( 6.8 %)  
D:    31 ( 4.4 %)  
E:    95 (13.5 %) 
F:    49 ( 7.0 %)  
G:     2 ( 0.3 %)  
H:    21 ( 3.0 %)  
I:    60 ( 8.5 %)  
J:     0 ( 0.0 %)  
K:     0 ( 0.0 %)  
L:    49 ( 7.0 %)  
M:    10 ( 1.4 %) 
                  
N:    71 (10.1 %) 
O:    50 ( 7.1 %)  
P:     7 ( 1.0 %)  
Q:     8 ( 1.1 %)  
R:    51 ( 7.3 %)  
S:    24 ( 3.4 %)  
T:    54 ( 7.7 %)  
U:    29 ( 4.1 %)  
V:     1 ( 0.1 %)  
W:     6 ( 0.9 %)  
X:     3 ( 0.4 %)  
Y:     1 ( 0.1 %)  
Z:     1 ( 0.1 %)

Hint 1: You will need to store 26 different counts, one for each letter. Use an array of ints, similar to the array used for the dice statistics example, to store all of the counts.

Hint 2: Since C++ treats the char type as a special kind of integer, standard arithmetic operations can be applied to characters. In particular, when you subtract two letters, the result is the relative difference between the two letters. For example, the expressions ('C'-'A') and ('p'-'n') would both evaluate to 2 since the letters in each expression are two positions apart. In general, you can determine the position of any letter in the alphabet by subtracting 'A' or 'a' (depending on case).

Hint 3: The <cctype> library contains numerous useful routines for testing and manipulating characters. The isalpha function takes a character as argument and returns true if that character is an alphabetic character. The toupper function takes a character as argument and returns that character made upper case, while tolower likewise converts a character to lower case.

For testing purposes, you may download the following public-domain texts: