Empirical Lab Repository

Title: Author Identification via Word Lengths

Author: Dave Reed, Creighton University, davereed@creighton.edu

Possible Courses: CS1

Empirical Concepts Introduced: data analysis

Computer Science Concepts Used: file I/O, arrays, counters, loops, string manipulation

Summary: This assignment involves analyzing patterns that may appear in works of literature by an author. It has been shown that authors tend to follow the same patterns in their writing style (e.g., favoring longer, more sophistacted words over short, simple words), and these patterns have been used by researchers in identifying the author of uncredited works. In this assignment, students will analyze works of literature with respect to word lengths. Each work is read from a file, one word at a time, stripped of all non-letters, and counts for each of the corresponding word lengths maintained. The absolute and relative frequencies of each word length are then displayed in a table for analysis.

This program utilizes several common control structures and tools, including file I/O for reading the literature text, an array of counters for maintaining letter frequency counts, a loop to iterate over the characters, and various string manipulation routines for removing non-letters. The fact that programs can be used to analyze the patterns in complex data is demonstrated via questions that the students are asked to consider at the end.

Variations: To further emphasize the role of analysis, additional questions could be asked of the students. For example, the instructor might provide several works of literature by two different authors, then provide an uncredited work and ask the student to identify (and justify) the author.