CSC 221: HW 5

CSC 221: Introduction to Programming
Fall 2023

HW 5: Files and Lists

In HW4, you wrote python code that processed a text file and produced a report on character types in the file. For this assignment, you will expand on that example to generate a comprehensive report on word and letter usage. In theory, such a literary fingerprint could provide insight into an author's methods and could also help to identify the author of anonymous or disputed works of literature. The report generated by your code must include:

the filename
the number of sentences in the file
the number of words in the file
the number of letters in the file
the average number of words in each sentence
the average number of letters in each word
the number of short words (between 1 and 3 letters) in the file, and the percentage of short words (with respect to the total number of words)
the number of long words (8 or more letters) in the file, and the percentage of long words (with respect to the total number of words)
the number of occurrences of each letter, ignoring cases, and the percentage of each letter (with respect to the total number of letters)
a histogram that shows the (rounded) percentages of each letter as a row of asterisks

To make things easier, we will make some simplifying assumptions. We will assume that any sequence of characters, delineated by whitespace, that ends in a terminal punctuation mark (. ! ?) or a terminal punctuation mark followed by a quote (." !" ?" .' !' ?') designates a sentence. This may lead to some counting errors, such as I climbed Mt. Shasta. counting as two sentences. However, these counting errors may balance out, as sentences that end with brackets (e.g., [He whispered quietly.]) will not be counted. Likewise, we will assume that any sequence of characters, delineated by whitespace, that contains at least one letter is a word. For example, "2023" would not count as a word since it contains no letters, but "202x" would count as a word of length 1.

Due to these simplifying assumptions, it is possible for a file to have 0 sentences (e.g., if all punctuation marks are enclosed in brackets) or even 0 letters (e.g., if the file contains only numbers). Your code should not crash in extreme cases such as these but should either print a warning message or simply omit displaying any ill-defined statistics.

Finally, note that you are asked to report the number of letters in the file, not the number of characters. You should ignore non-letters (e.g., whitespace, punctuation marks, digits) when calculating this total. Likewise, the average word length and the definitions of short and long words depend on the number of letters, not the number of characters. This ensures that punctuation marks do not count in word lengths (e.g., end! should count as a 3-letter word).

Coding (95%)

Your main function, call it fingerprint, should take the filename as input and display all of the statistics described above. Since this function may become long, you may want to write some helper functions that are called by fingerprint to carry out supporting tasks (e.g., printing the letter frequency table). All averages and percentages should be rounded to one decimal place, and all letter frequency stats should be aligned in columns. For example, the execution on the left shows the fingerprint statistics for Lewis Carroll's Alice's Adventures in Wonderland. The execution on the right shows the fingerprint statistics for a short story by an unidentified author.

You should test your code on small files for which you can hand-calculate stats. Once you are confident it works as desired, you can test your code on the following public-domain texts:

Alice's Adventures in Wonderland, by Lewis Carroll
Bartleby, The Scrivener, by Herman Melville
The Pit and the Pendulum, by Edgar Allan Poe
The Tragedy of Hamlet, Prince of Denmark, by William Shakespeare
The Notorious Jumping Frog of Calaveras County, by Mark Twain

Analysis (5%)

It so happens that mystery.txt was written by one of the five authors listed above: Carroll, Melville, Poe, Shakespeare, or Twain. Compare the fingerprint statistics for this short story with the five known works and try to predict the unidentified author. In a separate document, include the fingerprint statistics for the author you have chosen, and provide a brief rationale for why you think this fingerprint best matches the mystery fingerprint.

CSC 221: Introduction to Programming Fall 2023 HW 5: Files and Lists

Coding (95%)

Analysis (5%)

CSC 221: Introduction to Programming
Fall 2023

HW 5: Files and Lists