Empirical Lab Repository

Title: Author Identification via Letter Frequencies

Author: Dave Reed, Creighton University, davereed@creighton.edu

Possible Courses: CS1

Empirical Concepts Introduced: data analysis

Computer Science Concepts Used: file I/O, arrays, counters, loops, character manipulation

Summary: This assignment involves analyzing patterns that may appear in works of literature by an author. It has been shown that authors tend to follow the same patterns in their writing style (e.g., favoring certain words and letters over others), and these patterns have been used by researchers in identifying the author of uncredited works. In this assignment, students will analyze works of literature with respect to letter frequencies. Each work is read from a file, one character at a time, and counts for each of the letters maintained. The absolute and relative frequencies of each letter are then displayed in a table for analysis.

This program utilizes several common control structures and tools, including file I/O for reading the literature text, an array of counters for maintaining letter frequency counts, a loop to iterate over the characters, and various character manipulation routines for case-insnesitivity and ignoring non-letters. The fact that programs can be used to analyze the patterns in complex data is demonstrated via questions that the students are asked to consider at the end.

Variations: To further emphasize the role of analysis, additional questions could be asked of the students. For example, the instructor might provide several works of literature by two different authors, then provide an uncredited work and ask the student to identify (and justify) the author.