Name: _________________________________________
Historians have long used personal letters, journals, and other private writings of politicians and influential leaders to study their actions and motivations. In our age of social media, this has been expanded to include social media uploads such as tweets, Facebook messages and Instagram posts. President Trump has been a prolific social media user and often uses his Twitter feed to speak directly to the people. Years from now, historians will study his tweets to gain insights into his leadership style and motivations. Data scientists use advanced software tools (e.g., SAS and SPSS) and programming languages (e.g., R and Python) to analyze data sets such as this. For this lab, however, you will use a simple and widely-available tool, Microsoft Excel.
The comma-separated value (CSV) file TrumpTweets.csv contains every tweet (42,462 of them) posted by Donald Trump, up through 11/10/19. This is a large file, requiring 6.4 MB of storage. If you tried to print the file out in a readable format, it would fill more than 2,700 pages! Each line in the TrumpTweets.csv file provides the following information about each tweet: the text of that tweet, the time that it was sent (in Greenwich Mean Time), the number of times that tweet was retweeted, and the number of times that tweet was favorited. For example, the two most-recent tweets are listed below (along with the column headers):
The raw data for this file was scraped from the Web site www.trumptwitterarchive.com. As is often the case when dealing with large data sets, the raw data had to be filtered to remove irrelevant information and to repair errors in the data (e.g., missing fields, duplicate entries, unreadable characters or emojis, etc.). The Twitter data from the Web site was filtered, cleaned and organized so that the tweets are in reverse chronological order, with the most recent tweet at the top of the file.
Download a copy of TrumpTweets.csv and store it on your computer. Then, open that file in Microsoft Excel. Since this is a CSV file, and not a full-featured Excel spreadsheet, formatting is not saved in the file. In particular, the columns are shown with minimal width, and so their contents may not be fully visible. You can resize the columns by clicking-and-dragging on the lines between columns. Also, you can format the tweet column so that the tweets wrap across lines by clicking on the A column header, then selecting Format --> Cells --> Wrap text.
Since the tweets are ordered by date, there are a number of questions that you can answer just by looking at the file. For example, suppose you wanted to know how many tweets Donald Trump posted on his election day (November 8, 2016). By scrolling through the file, we find that the first tweet on that date is in row 12,523 and the last tweet on that date is at 12,507. That means that there were 12,523 - 12,507 + 1 = 17 tweets on that day.
QUESTION 1: Answer the following questions by scrolling through the file and noting the dates and line numbers of tweets:
- On what date was Trump's first tweet?
- How many tweets has Trump posted so far in 2019?
- Did Trump's tweet production increase as President in 2017 compared to 2016, when he was mostly campaigning? List the number of tweets for these two years.
Excel has a number of built-in features that can be used to search and analyze entries. For instance, the Find feature (which can be accessed using the magnifying glass box in the upper right or else through the Edit --> Find menu) can find the location of a word or phrase within the spreadsheet. As an example, suppose we wanted to find the date of Trump's famous "covfefe" tweet. By entering the word "covfefe" in the Find box and hitting return, the first entry containing that word is found (on May 31, 2017). Hitting the right-facing triangle will find subsequent entries that match the word - it turns out there are two tweets on that day that contain "covfefe" and those are the only mentions.
QUESTION 2: Answer the following questions using the Find feature in Excel:
- How many times is Nebraska mentioned in the tweets?
- How many times is Creighton mentioned in the tweets?
Tasks such as finding the earliest tweet or counting the number of tweets on a given day are made simple because the tweets are ordered chronologically. Other questions, such as which tweet was retweeted the most, are not so simple since they require looking at every entry in the "# RETWEETS" column and finding the maximum. If we could reorder the tweets by the number of retweets, finding the most popular would be trivial.
Excel provides a tool for sorting data based on any column (or combination of columns). First, you need to select the entire contents of the spreadsheet by typing Command-A on a Mac (or Control-A under Windows). This highlights all of the columns. Then, select Data --> Sort from the top menu, and pick "# RETWEETS" as the column to sort by and "Largest to Smallest" as the order. This will reorder the tweets so that the "# RETWEETS" column is in descending order. The most popular tweet (with 369,530 retweets) is at the top.
QUESTION 3: Sort the file contents based on the number of retweets (in descending order) and answer the following questions:
- What are the three most popular tweets (in terms of numbers of retweets)? You don't need to provide the text of the tweets, just the numbers of retweets for each.
- How many of the tweets had no retweets at all?
- How popular was the famous "Despite the constant negative press covfefe" tweet (in terms of number of retweets)?
- Where does the "covfefe" tweet rank in retweet popularity (e.g., if it was in row 1000 after sorting, that would make it the 999th most popular tweet).
QUESTION 4: Similarly, sort the file contents based on the number of favorites (in descending order) and answer the following questions:
- What are the three most popular tweets (in terms of numbers of favorites)? You don't need to provide the text of the tweets, just the numbers of favorites for each.
- How many of the tweets had no favorites at all?
- How popular was the famous "Despite the constant negative press covfefe" tweet (in terms of number of favorites)?
- Where does the "covfefe" tweet rank in favorite popularity (e.g., if it was in row 1000 after sorting, that would make it the 999th most popular tweet).
Similar to how JavaScript provides useful functions like Math.pow
and Math.round
, Excel provides built-in functions that can be placed in cells to perform useful tasks. For example, the SUM function will calculate the sum of all of the numbers in some range of cells. The following formula, when placed in a cell within the spreadsheet, will display the total number of retweets for all of the tweets:
Similarly, the AVERAGE function calculates the average of the numbers in a range, e.g.,
Note that formulas involving functions or other math operations begin with an equals sign.
QUESTION 5: Use the SUM and AVERAGE functions to answer the following questions:
- What is the total number of retweets?
- What is the average number of retweets?
- Similarly, what is the total number of favorites?
- What is the average number of favorites?
Another function that is useful in many contexts is COUNTIF. The COUNTIF function takes a range of cells and a string as inputs, and will count the number of cells in that range that have that string as their contents. For example,
would calculate the number of 0 entries in the "# RETWEETS" column. Note that the contents of the cell must match the string in its entirety. If you want to count cells in which the string appears as a part of the contents, you can use the wildcard character "*". Placing a "*" before or after text in a string denotes that any characters can match in its place. Thus, the expression "0*" matches any string that starts with a "0", "*0" matches any string that ends with a "0", and "*0*" matches any string that contains a "0". Wild cards can be useful when you want to count tweets that contain a word or phrase. For example:
would count the number of tweets that contain the word "Nebraska" (regardless of where the word appears in the tweet).
QUESTION 6: Use the COUNTIF function to answer the following questions:
- Use the COUNTIF function to count the number of tweets that contain "Nebraska?" Does this number match the count you obtained earlier using Find?
- How many tweets contain at least one exclamation point?
- What percentage of tweets contain at least one exclamation point? (Recall: there are 42,462 tweets in the file.)
- Tweets that are retweets of a previous post begin with "RT". How many tweets posted by Trump were retweets?
- What percentage of tweets were retweets?
- Identify three non-trivial words or phrases (excluding words like "the" and "and") that appear in more than 2,000 tweets.
The COUNTIF function is useful for answering simple questions about the data. More complex questions, however, may require a multi-step approach. For example, suppose we wanted to know how many of the tweets were short (say < 20 characters) and how many were long (> 100 characters). To answer these questions, we first need to determine lengths of the individual tweets. Then we could sort the tweets based on length, or use functions such as COUNTIF and MAX to answer questions concerning length.
Calculating the length of the first tweet is fairly straightforward. Go to cell E2 and enter the following:
The LEN function will calculate and display the length of the tweet in cell A2. You could then enter a similar formula for the remaining 42,461 tweets (!), or thankfully use a shortcut to automate the process. If you select cell E2 then double-click on the square at the bottom right of that cell, the formula will be copied throughout the entire column. You can then determine the length of the longest tweet and count the number short and long tweets.
QUESTION 7: Generate the column of text lengths and answer the following questions:
- What is the length of the longest tweet?
- What is the length of the shortest tweet?
- How many tweets consist of fewer than 20 characters?
- How many tweets consist of more than 100 characters?
Suppose we wanted to know if there is a pattern to the tweets in terms of days of the week. For instance, does Trump tweet more often on the weekend, or midweek? In order to determine this, we would need to be able to extract the day of the week from the tweet dates, then count the number of tweets for each day. Similar to the column of tweet lengths that we created above, we will need to extract useful data about the tweets and store them in new columns.
QUESTION 8: Once the G column contains the day of the week for each tweet, you can use the COUNTIF function to count the number of tweets for each day:
- For each day of the week, how many tweets were posted on that day? Which day has the most tweets?
EXTRA CREDIT: Determine what hour of the day that Trump tweets the most. In order to do this, you will need to adjust for the Eastern time zone (as above) and extract the hour from that date using the HOUR function. Then, you can count the number of tweets for each hour of the day (0-23).