Name: _________________________________________


CSC 121: Computers and Scientific Thinking
Fall 2019

Lab 4: Twitter Analytics

Historians have long used personal letters, journals, and other private writings of politicians and influential leaders to study their actions and motivations. In our age of social media, this has been expanded to include social media uploads such as tweets, Facebook messages and Instagram posts. President Trump has been a prolific social media user and often uses his Twitter feed to speak directly to the people. Years from now, historians will study his tweets to gain insights into his leadership style and motivations. Data scientists use advanced software tools (e.g., SAS and SPSS) and programming languages (e.g., R and Python) to analyze data sets such as this. For this lab, however, you will use a simple and widely-available tool, Microsoft Excel.

The comma-separated value (CSV) file TrumpTweets.csv contains every tweet (42,462 of them) posted by Donald Trump, up through 11/10/19. This is a large file, requiring 6.4 MB of storage. If you tried to print the file out in a readable format, it would fill more than 2,700 pages! Each line in the TrumpTweets.csv file provides the following information about each tweet: the text of that tweet, the time that it was sent (in Greenwich Mean Time), the number of times that tweet was retweeted, and the number of times that tweet was favorited. For example, the two most-recent tweets are listed below (along with the column headers):

Trump Tweets

The raw data for this file was scraped from the Web site www.trumptwitterarchive.com. As is often the case when dealing with large data sets, the raw data had to be filtered to remove irrelevant information and to repair errors in the data (e.g., missing fields, duplicate entries, unreadable characters or emojis, etc.). The Twitter data from the Web site was filtered, cleaned and organized so that the tweets are in reverse chronological order, with the most recent tweet at the top of the file.



Passive Questions

Download a copy of TrumpTweets.csv and store it on your computer. Then, open that file in Microsoft Excel. Since this is a CSV file, and not a full-featured Excel spreadsheet, formatting is not saved in the file. In particular, the columns are shown with minimal width, and so their contents may not be fully visible. You can resize the columns by clicking-and-dragging on the lines between columns. Also, you can format the tweet column so that the tweets wrap across lines by clicking on the A column header, then selecting Format --> Cells --> Wrap text.

Since the tweets are ordered by date, there are a number of questions that you can answer just by looking at the file. For example, suppose you wanted to know how many tweets Donald Trump posted on his election day (November 8, 2016). By scrolling through the file, we find that the first tweet on that date is in row 12,523 and the last tweet on that date is at 12,507. That means that there were 12,523 - 12,507 + 1 = 17 tweets on that day.

QUESTION 1:    Answer the following questions by scrolling through the file and noting the dates and line numbers of tweets:

Excel has a number of built-in features that can be used to search and analyze entries. For instance, the Find feature (which can be accessed using the magnifying glass box in the upper right or else through the Edit --> Find menu) can find the location of a word or phrase within the spreadsheet. As an example, suppose we wanted to find the date of Trump's famous "covfefe" tweet. By entering the word "covfefe" in the Find box and hitting return, the first entry containing that word is found (on May 31, 2017). Hitting the right-facing triangle will find subsequent entries that match the word - it turns out there are two tweets on that day that contain "covfefe" and those are the only mentions.

QUESTION 2:    Answer the following questions using the Find feature in Excel:



Sorting-based Questions

Tasks such as finding the earliest tweet or counting the number of tweets on a given day are made simple because the tweets are ordered chronologically. Other questions, such as which tweet was retweeted the most, are not so simple since they require looking at every entry in the "# RETWEETS" column and finding the maximum. If we could reorder the tweets by the number of retweets, finding the most popular would be trivial.

Excel provides a tool for sorting data based on any column (or combination of columns). First, you need to select the entire contents of the spreadsheet by typing Command-A on a Mac (or Control-A under Windows). This highlights all of the columns. Then, select Data --> Sort from the top menu, and pick "# RETWEETS" as the column to sort by and "Largest to Smallest" as the order. This will reorder the tweets so that the "# RETWEETS" column is in descending order. The most popular tweet (with 369,530 retweets) is at the top.

QUESTION 3:    Sort the file contents based on the number of retweets (in descending order) and answer the following questions:




QUESTION 4:    Similarly, sort the file contents based on the number of favorites (in descending order) and answer the following questions:





Function-based Questions

Similar to how JavaScript provides useful functions like Math.pow and Math.round, Excel provides built-in functions that can be placed in cells to perform useful tasks. For example, the SUM function will calculate the sum of all of the numbers in some range of cells. The following formula, when placed in a cell within the spreadsheet, will display the total number of retweets for all of the tweets:

=sum(c2:c42463)

Similarly, the AVERAGE function calculates the average of the numbers in a range, e.g.,

=average(c2:c42463)

Note that formulas involving functions or other math operations begin with an equals sign.


QUESTION 5:    Use the SUM and AVERAGE functions to answer the following questions:


Another function that is useful in many contexts is COUNTIF. The COUNTIF function takes a range of cells and a string as inputs, and will count the number of cells in that range that have that string as their contents. For example,

=countif(c2:c42463, "0")

would calculate the number of 0 entries in the "# RETWEETS" column. Note that the contents of the cell must match the string in its entirety. If you want to count cells in which the string appears as a part of the contents, you can use the wildcard character "*". Placing a "*" before or after text in a string denotes that any characters can match in its place. Thus, the expression "0*" matches any string that starts with a "0", "*0" matches any string that ends with a "0", and "*0*" matches any string that contains a "0". Wild cards can be useful when you want to count tweets that contain a word or phrase. For example:

=countif(a2:a42360, "*Nebraska*")

would count the number of tweets that contain the word "Nebraska" (regardless of where the word appears in the tweet).





QUESTION 6:    Use the COUNTIF function to answer the following questions:



Multi-step Questions

The COUNTIF function is useful for answering simple questions about the data. More complex questions, however, may require a multi-step approach. For example, suppose we wanted to know how many of the tweets were short (say < 20 characters) and how many were long (> 100 characters). To answer these questions, we first need to determine lengths of the individual tweets. Then we could sort the tweets based on length, or use functions such as COUNTIF and MAX to answer questions concerning length.

Calculating the length of the first tweet is fairly straightforward. Go to cell E2 and enter the following:

=len(a2)

The LEN function will calculate and display the length of the tweet in cell A2. You could then enter a similar formula for the remaining 42,461 tweets (!), or thankfully use a shortcut to automate the process. If you select cell E2 then double-click on the square at the bottom right of that cell, the formula will be copied throughout the entire column. You can then determine the length of the longest tweet and count the number short and long tweets.

QUESTION 7:    Generate the column of text lengths and answer the following questions:



Suppose we wanted to know if there is a pattern to the tweets in terms of days of the week. For instance, does Trump tweet more often on the weekend, or midweek? In order to determine this, we would need to be able to extract the day of the week from the tweet dates, then count the number of tweets for each day. Similar to the column of tweet lengths that we created above, we will need to extract useful data about the tweets and store them in new columns.

  1. The dates in column B are in Greenwich Mean Time (GMT). GMT is commonly used as a base time zone since the location where each tweet was posted (and its corresponding time zone) is not known. Since GMT is 5 hours later than the Eastern U.S. time zone, where the majority of Trump's tweets were posted, it makes sense to create a new date column, adjusted to the Eastern time zone. This can be accomplished by entering the following formula in cell F2 and copying that formula through the entire F column: =b2-(5/24)
  2. Next, we need to create a column that contains the day of the week for each tweet. This can be accomplished by entering the following formula in G2 and copying that formula through the entire G column: =text(f2, "ddd")

QUESTION 8:    Once the G column contains the day of the week for each tweet, you can use the COUNTIF function to count the number of tweets for each day:

EXTRA CREDIT:    Determine what hour of the day that Trump tweets the most. In order to do this, you will need to adjust for the Eastern time zone (as above) and extract the hour from that date using the HOUR function. Then, you can count the number of tweets for each hour of the day (0-23).