CSC110-Project/index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>CSC110 Report</title>
    <link rel="stylesheet" href="html/style.css">
</head>
<body>
<div id="content">

</div>
<script src="html/marked.min.js"></script>
<script src="html/jquery.min.js"></script>
<script src="html/polyfill.es6.min.js"></script>
<script src="html/mathjax-tex-mml-chtml.js"></script>

<script>

// Python will inject the markdown code here.
markdown = {"content": "\n# Shifting Interest in COVID-19 Twitter Posts\n\n## Introduction\n\nWe have observed that there have been increasingly more voices talking about COVID-19 since the start of the pandemic. However, different groups of people might view the importance of discussing the pandemic differently. For example, we don't know whether the most popular people on Twitter will be more or less inclined to post COVID-related content than the average Twitter user. Also, while some audience finds these content interesting, others quickly scroll through them. **So, we aim to compare people's interests in posting coronavirus content and the audience's interests in viewing them between different groups.** Also, with recent developments and policy changes toward COVID-19, it is unclear how people's discussions would react. Some people might believe that the pandemic is starting to end so that discussing it would seem increasingly like an unnecessary effort, while others might find these policy changes controversial and want to voice their opinions even more. Also, even though COVID-related topics are almost always on the news, some news outlets might intentionally cover them more frequently than others. For the people watching the news, some people might find these news reports interesting, while others can't help but switch channels. So, how people's interest in listening or discussing COVID-related topics changes over time is not very clear. **Our second goal is to analyze how people's interest in COVID-related topics changes and how frequently people have discussed COVID-related issues in the two years since the pandemic started.**\n\n# Method\n\n## Demographics\n\nOur data come from three samples:\n\n* `500-pop`: The list of 500 most followed users on Twitter who post in English, Chinese, or Japanese.\n* `500-rand`: A sample of 500 random users on Twitter who post in English, Chinese, or Japanese with at least 1000 posts and at least 150 followers.\n* `eng-news`: A list of 100 top news Twitter accounts by Nur Bremmen [[1]](#ref1), combined with all news accounts which TwitterNews reposted. All of them post in English, and most of them target audience in North America.\n\nWe also counted the number of people speaking each language:\n\n|            |   Total |   English |   Chinese |   Japanese |\n|------------|---------|-----------|-----------|------------|\n| `500-pop`  |     500 |       495 |         0 |          5 |\n| `500-rand` |     500 |       393 |        15 |         92 |\n\n\n## Data Collection\n\n1. To create our samples, we collected a wide range of Twitter users using Twitter's get friends list API endpoint [(documentation)](https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-friends-list) and the follows-chaining technique. We specified one single user as the starting point, obtained the user's friends list, then we picked 3 random users and 3 most followed users from the friend list, add them to the queue, and start the process again from each of them. Because of Twitter's rate limiting on the get friends list endpoint, we can only obtain a maximum of 200 users per minute, with many of them being duplicates. We ran the program for one day and obtained 224,619 users (852.3 MB decompressed). However, only the username, popularity, post count, and language data are kept after processing (filtering). The processed user dataset `data/twitter/user/processed/users.json` is 7.9 MB in total. We selected our samples by filtering the results first based on language, selected the top 500 most followed users as `500-pop`, filtered the list again based on post count (>1000) and followers (>150), then selected a random sample of 500 users as `500-rand`.\n\n2. We also downloaded all tweets from our sampled users through the user-timeline API [(documentation)](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline). Due to rate limiting, the program took around 16 hours to finish, and we obtained 7.7 GB of raw data (uncompressed). During processing, for each tweet, we extracted only its date, popularity (likes + retweets), whether it is a retweet, and whether it is COVID-related. The text of the tweets are not retained, and the processed data directory `data/twitter/user-tweets/processed/` is 141.6 MB in total.\n\n3. We also used the COVID-19 daily cases data published by New York Times [[3]](#ref3) to compare with peaks and through in our frequency over date graph.\n\n## Computation & Filtering\n\nTo analyze the frequencies and relative popularity of COVID-related posting either for all posts from a specific user, or for a sample across many users for a specific date, we defined several formulas. First, we need to define many terms we will use in the following sections:\n\n* **Frequency**: The percentage of COVID-related posts compared to all posts, showing how frequent COVID-related content are posted.\n* **Popularity**: The integer value representing the popularity of a post, measured by the total number of user interactions on a post, which is the number of likes and comments on a tweet combined.\n* **Popularity Ratio**: The relative popularity between 0 and infinity calculating how popular are a user's COVID-posts compared to all the user's posts, which is a ratio of the average popularity of COVID-posts over all posts. If COVID-posts are more popular, then this value should be greater than 1, and if they are less popular, this value should be less than 1. Since follower count and interaction rate differs wildly between users, we cannot assume that popularity is comparable between users, so popularity is only compared within a user, while popularity ratio can be compared across users.\n\n### 1. Computation - User Analysis\n\nIn the first section, we used the following formulas to calculate statistical distributions of the frequencies and popularity ratios of users in a sample:\n\n<blockquote>\n$$ \\text{freq}_{u} = \\frac{|\\text{COVID-posts by } u|}{|\\text{All posts by } u|} $$\n</blockquote>\n\n<blockquote>\n$$ \\text{pop_ratio}_{u} = \\left(\\frac{\\sum\\text{Popularity of COVID-posts by } u}{|\\text{COVID-posts by } u|}\\right) / \\left(\\frac{\\sum \\text{Popularity of all posts by } u}{|\\text{All posts by } u|}\\right) $$\n</blockquote>\n\nThe frequency equation can divide by zero if the user has zero posts, and it is logical to assign the frequency to 0 when the user didn't post anything. However, it is not sensible to assign the popularity ratio to zero when the pop_ratio equation divides by zero. There are three divisions in the pop_ratio equation, so there are three possible places where it might divide by zero. To prevent division by zero, people who didn't post about COVID-19, who didn't post anything at all, and who have literally 0 popularity on any of their posts are ignored. In our data, this amount of people are ignored for each sample:\n\n|         |   `500-pop` |   `500-rand` |   `eng-news` |\n|---------|-------------|--------------|--------------|\n| Ignored |         117 |          205 |           28 |\n\n\nThen, the users' results are graphed in one histogram for each sample to gain some insights about the distribution of user frequencies. However, there are many outliers and more than half who posted below 0.1% for two of our samples, making the graphs unreadable: (You can click on the images to enlarge them, and hold down E to view full screen)\n\n<div class=\"image-row\">\n    <div><img src=\"/freq/500-pop-hist-outliers.png\" alt=\"hist\"></div>\n    <div><img src=\"/freq/500-rand-hist-outliers.png\" alt=\"hist\"></div>\n    <div><img src=\"/freq/eng-news-hist-outliers.png\" alt=\"hist\"></div>\n</div>\n\nFor example, even though most of `500-rand` are concentrated below 10%, the x-axis scale is stretched to 50% by many outliers who post more than 40%:\n\n| Username        | Frequency   |\n|-----------------|-------------|\n| JHUCAIH         | 54.8%       |\n| DrJudyMonroe    | 49.6%       |\n| PoltergeistTC   | 41.6%       |\n| _FatmaAhmed     | 31.6%       |\n| OUCitizenGovern | 28.6%       |\n| btolchin        | 27.0%       |\n\nTo resolve this, the outliers are removed both for frequencies and popularity ratios using the method proposed by Boris Iglewicz and David Hoaglin (1993) [[2]](#ref2), and for frequencies, everyone who posted below 0.1% are ignored when graphing histograms. They are not ignored in statistic calculations.\n\n### 2. Computation - Change Analysis\n\nThe second section analyzes data separate for each of our samples, just like the first section. However, unlike how calculations are separated for each user in the first section, the second section separates calculation by date and combines users in a sample. We defined the start of COVID-19 as _2020-01-01_ and ignored all posts prior to this date. Then, the average frequency and popularity ratio are calculated for every day since _2020-01-01_. This calculation gave us a list `freqs` and a list `pops` where, for every date `dates[i]`,\n\n<blockquote>\n$$ \\text{freq}_i = \\frac{|\\text{COVID-posts on date}_{i}|}{|\\text{All posts on date}_{i}|} $$\n</blockquote>\n\n<blockquote>\n$$ \\text{pop_ratio}_i = \\frac{ \\sum_{u \\in \\text{Users}} \\left(\\frac{\\sum\\text{Popularity of u's COVID-posts on date}_i}{|\\text{u's COVID-posts on date}_i| \\cdot (\\text{Average popularity of all u's posts})}\\right)}{(\\text{Number of users posted on date}_i)} $$\n</blockquote>\n\nAfter calculation, `freqs` and `pops` are plotted in line graphs against `dates`. Initially, we are seeing graphs with very high peaks such as the graph below. After some investigation, we found that these peaks are caused by not having enough tweets on each day to average out the random error of one single popular tweet. For example, in the graph below, we adjusted the program to print different users' popularity ratios when we found an average popularity ratio of greater than 20, which produced the output on the right. As it turns out, on 2020-07-11, the user @juniorbachchan posted that he and his father tested positive, and that single post is 163.84 times more popular than the average of all his posts. (The post is linked [here](https://twitter.com/juniorbachchan/status/1282018653215395840), it has 235k likes, 25k comments, and 32k retweets). Even though these data points are outliers, there isn't an effective way of removing them since we don't have enough tweets data from each user to calculate their range (for example, someone's COVID-related post might be the only one they've posted). So, we've decided to limit the viewing window to `y = [0, 2]` as shown in the graph on the right.\n\n<div class=\"image-row\">\n    <div><img src=\"html/peak-1.png\" alt=\"graph\"></div>\n    <div style=\"display: flex; flex-direction: column; justify-content: center\"><pre>\nDate:  2020-07-11\n- JoeBiden 1.36\n<span class=\"highlight\">- juniorbachchan 163.84</span>\n- victoriabeckham 0.80\n- anandmahindra 7.66\n- gucci 0.13\n- StephenKing 0.61\n    </pre></div>\n    <div><img src=\"html/peak-2.png\" alt=\"graph\"></div>\n</div>\n\nThen, we encountered the issue of noise. When we plot the graph without a filter, we found that the graph is actually very noisy. We decided to average the results over 7 days. Then, we also experimented with different filters from the `scipy` library and different parameter values, and chose to use an IIR filter with `n = 10`.\n\n<div class=\"image-row\">\n    <div><img src=\"/change/n/5.png\" alt=\"graph\"></div>\n    <div><img src=\"/change/n/10.png\" alt=\"graph\"></div>\n    <div><img src=\"/change/n/15.png\" alt=\"graph\"></div>\n</div>\n\n# Results\n\n## User Analysis\n\nThis section ignores date and focuses on user differences within our samples, which will answer the first part of our research question: **how frequently does people post about COVID-related issues, and how interested are people to see COVID-related posts?**\n\n### 1. User Posting Frequency\n\nFirst, the users' COVID-related posting frequency in these three datasets are analyzed. Initially, we were expecting that most people will post coronavirus content because this pandemic is very relevant to everyone. However, there are many people in our samples didn't post about COVID-19 at all. The following table shows how many people in each sample didn't post or posted less than 1% about COVID-19:\n\n|                               |   `500-pop` |   `500-rand` |   `eng-news` |\n|-------------------------------|-------------|--------------|--------------|\n| Total users                   |         500 |          500 |          310 |\n| Users who didn't post at all  |         117 |          205 |           26 |\n| Users who posted less than 1% |         288 |          313 |           57 |\n\n\nThe `eng-news` sample has the lowest number of users who didn't have COVID-related posts, the `500-rand` sample has the highest, while `500-pop` sits in between. This large difference between `eng-news` and the rest can be explained by the news channels' obligation to report news, which includes news about new outbreaks, progress of vaccination, new cross-border policies, etc. Also, `500-pop` has much more users who posted COVID-related content than `500-rand`, while they have similar amounts of users posting less than 1%. This finding might be explained by how influential people have more incentive to express their support toward slowing the spread of the pandemic than regular users, which doesn't require frequent posting like news channels.\n\nThen, the calculated frequency data for each user in a sample are graphed in histograms:\n\n<div class=\"image-row\">\n    <div><img src=\"/freq/500-pop-hist.png\" alt=\"hist\"></div>\n    <div><img src=\"/freq/500-rand-hist.png\" alt=\"hist\"></div>\n    <div><img src=\"/freq/eng-news-hist.png\" alt=\"hist\"></div>\n</div>\n\nAs expected, the distributions looks right-skewed, with most people posting not very much. One interesting distinction is that, even though the distributions follow similar shapes, the x-axis ticks of `eng-news` is actually ten times larger than the other two, which means that `eng-news` post a lot more about COVID-19 on average than the other two samples. Statistics of the samples are calculated to further verify these insights:\n\n|          | `500-pop`   | `500-rand`   | `eng-news`   |\n|----------|-------------|--------------|--------------|\n| Median   | 1.3%        | 1.5%         | 7.6%         |\n| IQR      | 3.5%        | 4.3%         | 7.8%         |\n| Q1 (25%) | 0.4%        | 0.5%         | 3.1%         |\n| Q3 (75%) | 3.9%        | 4.8%         | 10.9%        |\n\nSince there are many outliers, medians and IQR will more accurately represent the center and spread of this distribution. As these numbers show, `eng-news` do post much more (a 6.1% increment in post frequency, or a 406.7% increase) than the other two samples. Again, this can be explained by the news channels' obligation to report news related to COVID-19 or to promote methods to slow the spread of the pandemic. These means also shows that 50% of average Twitter users dedicate below 1.5% of their timeline to COVID-related posts.\n\n### 2. User Popularity Ratios\n\nSimilar histograms are graphed and statistics are calculated for user's popularity ratios in their sample, calculated using the formula described in the methods section:\n\n<div class=\"image-row\">\n    <div><img src=\"/pop/500-pop-hist.png\" alt=\"hist\"></div>\n    <div><img src=\"/pop/500-rand-hist.png\" alt=\"hist\"></div>\n    <div><img src=\"/pop/eng-news-hist.png\" alt=\"hist\"></div>\n</div>\n\nLooking at the histograms, while `eng-news` is roughly symmetric, the other two distributions are right skewed. \n\n|          |   `500-pop` |   `500-rand` |   `eng-news` |\n|----------|-------------|--------------|--------------|\n| Median   |        0.69 |         0.87 |         0.87 |\n| IQR      |        0.65 |         0.96 |         0.57 |\n| Q1 (25%) |        0.38 |         0.34 |         0.61 |\n| Q3 (75%) |        1.03 |         1.3  |         1.18 |\n\nThe calculated medians show that the audience normally don't like or comment on COVID-related posts as much as other posts by all three groups, which implies that people aren't as interested in these posts. The average Twitter user's and the average English news channel's COVID-posts has only 87% of the popularity compared to their other posts, while the average `500-pop` user has only 69% of the popularity. This difference is possibly because the most popular users' audience probably followed them for the specific types of content that only they can post, and not general COVID-related content that anyone can post similarly. \n\nAlso, even though the medians for `500-rand` and `eng-news` are the same, since the `500-rand` distribution is right skewed, its 25th percentile is much lower\u201425% of average Twitter users' COVID-posts are only 34% as popular as their other posts. \n\n## Change Analysis\n\nAfter we answered how frequently people posted about COVID-19 and how interested are people to view these posts, we analyze our data over the posting dates to answer the second part of our research question: **How does posting frequency and people's interests in COVID-19 posts changes from the beginning of the pandemic to now?**\n\n### 1. Posting Frequency Over Time\n\nWe graphed the posting frequencies of our three samples in line graphs with the x-axis being the date with labels representing the month, which gave us the following graphs:\n\n<div class=\"image-row\">\n    <div><img src=\"/change/freq/500-pop.png\" alt=\"graph\"></div>\n    <div><img src=\"/change/freq/500-rand.png\" alt=\"graph\"></div>\n    <div><img src=\"/change/freq/eng-news.png\" alt=\"graph\"></div>\n</div>\n\nLooking at three graphs individually, the posting rates were almost zero during the first two month when COVID-19 first started for all three samples, which is expected because no one knew how devastating it will be at that time. Then, all three samples had a peak in posting frequencies from March to June 2020. After June 2020, the posting rate for both `500-rand` and `eng-news` declined to around 1/3 of the peak, with `500-pop` declining slightly as well. While the reason to this decline is unclear, we speculate that it might be caused by people's loss of interest in the topic as they realize COVID-19 isn't going to be a disaster that fades away quickly, or as the news became less \"breaking\" and information started to repeat. Like the selective attention theory of cognitive psychology, people's attention to one thing comes at the expense of others since our attention is very limited. So people might have chosen to direct more attention to living rather than paying attention to the coronavirus that didn't seem to go away soon. Also, similar to how people will unconsciously learn to ignore repeated background noise after moving to a new environment (in a process called habituation), they might have learned to ignore the repeated information about COVID-19, which will lead to less COVID-related posting. Further research can determine whether this three-month attention span can generalize to other long-term disaster other than COVID-19.\n\nAfter June 2020, `500-rand` continued declining steadily without major peaks, while `eng-news` had a smaller peak around Dec 2020 and a trough after June 2021, and `500-top` had many peaks and toughs after. In an effort to interpret these peaks, we overlapped the three charts with the data of new COVID-19 cases in the U.S. published by New York Times [[3]](#ref3), which gave us the following graph: \n\n<div class=\"image-row\">\n    <div><img src=\"/change/comb/freq.png\" alt=\"graph\" class=\"large\"></div>\n</div>\n\nIn this graph, we can see that the peak around Dec 2020 and the trough around Jun 2021 in `eng-news` and `500-pop` actually correspond very closely with the rise and fall of new cases in the U.S., which is reasonable because there are more sensational news to report and more COVID-related events happening to popular individuals when cases are high. However, even though the first peak in cases around August 2020 did correlate with a peak in `500-rand`, the rise and fall of cases in the U.S. doesn't seem to affect `500-rand` overall. This is possibly because we included three languages in the population of our random sample, which means that `500-rand` isn't limited to English-speaking accounts that mostly target the U.S. audience like `eng-news`.\n\n### 2. Popularity Ratio Over Time\n\nWe graphed a similar graph with popularity ratio being the y-axis over date as the x-axis, as shown below:\n\n<div class=\"image-row\">\n    <div><img src=\"/change/comb/pop.png\" alt=\"graph\" class=\"large\"></div>\n</div>\n\nDespite efforts to filter out noise or normalize the graph discussed in the [method](#method) section, we did not find any patterns in the resulting graph. The peaks and troughs of each line seems random, and the three lines did not have common peaks or troughs that might reveal meaningful insights. The raw data looks very much like random noise as well. This lack of meaningful information is possibly because our sample size is comparatively small\u2014even though we have 500 users in our `500-pop` sample, the sample size for tweets by these users on one specific day is very small. For example, there are only 6 users in `500-pop` who posted on `2020-07-11`. This lack of samples amplified the effect of randomness, and more data may be needed to reduce the effect of one tweet on the popularity ratio for the specified date. Unfortunately, we have to reach a conclusion that more data is needed to reveal interesting findings.\n\n# Conclusion\n\nIn summary, key findings in our research include that while news channels post about COVID-19 more frequently (Median = 7.6%), average Twitter users and most popular users don't post very much (Median \u2264 1.5%). And while COVID-posting frequencies for `eng-news` and `500-pop` fluctuates with the number of new cases in the U.S., average Twitter users' COVID-posting frequency dropped and continued to decrease since Jun 2020. And these posts were not as popular (not liked or commented as much) as users' other posts (Median \u2264 0.87).\n\nThese findings might not be surprising, but they might have again demonstrated people's ability to adapt to new environments. The duration of the sensational effect of the start of COVID-19 might be similar to the grief from losing something important, they all fade over time as we adapt to them. Even though people focused a lot of attention on COVID-19 when new information first became available from March 2020, people's interest in these topics decreased as we adapt to the new norm with COVID-19 in three months, demonstrated by the quickly decreasing posting rates. Or, for the audience, rather than liking or commenting on COVID-19 posts, they might have quickly scrolled through them in favor of more interesting posts. It is fascinating that we can learn to adapt to such a devastating change in our environment in only three months.\n\n# References\n\n<a id=\"ref1\"></a>\n\n[1] Bremmen, N. (2010, September 3). The 100 most influential news media Twitter accounts. _Memeburn_. Retrieved November 27, 2021, from https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/.\n\n<a id=\"ref2\"></a>\n\n[2] Iglewicz, Boris, & David Hoaglin (1993), \"Volume 16: How to Detect and\nHandle Outliers\", _The ASQC Basic References in Quality Control:\nStatistical Techniques_, Edward F. Mykytka, Ph.D., Editor.\n\n<a id=\"ref3\"></a>\n\n[3] The New York Times. (2021). Coronavirus (Covid-19) Data in the United States. Retrieved November 27, 2021, from https://github.com/nytimes/covid-19-data.\n\n<a id=\"ref4\"></a>\n\n[4] WHO. (n.d.) _Listings of WHO's Response to COVID-19._ World Health Organization. Retrieved November 27, 2021, from https://www.who.int/news/item/29-06-2020-covidtimeline.\n"}

document.getElementById('content').innerHTML =
    marked.parse(markdown.content);

// Make images clickable
// Improved from: https://stackoverflow.com/a/50430187/7346633
body = $('body')
$('img').addClass('clickable').click(function() {
    const src = $(this).attr('src');
    let modal;

    function removeModal() {
        modal.remove();
        body.off('keyup.modal-close');
    }

    modal = $('<div id="modal">').css({
        background: 'RGBA(0,0,0,.5) url(' + src + ') no-repeat center',
        backgroundSize: $(this).hasClass('large') ? 'contain' : 'auto',
        width: '100vw',
        height: '100vh',
        position: 'fixed',
        zIndex: '100',
        top: '0',
        left: '0',
        cursor: 'zoom-out'
    }).click(function() {
        removeModal();
    }).appendTo('body');

    // Handling keyboard shortcuts
    body.on('keyup.modal-close', (e) => {
        if (e.key === 'Escape') removeModal();
        if (e.key === 'e') modal.removeClass('zoom')
    });
    body.on('keydown.modal-close', (e) => {
        if (e.key === 'e') modal.addClass('zoom')
    })
});

</script>
</body>
</html>