[+] Finalize project

This commit is contained in:
Hykilpikonna
2021-11-27 23:13:56 -05:00
parent 770d4345c4
commit c9338f08df
4 changed files with 21 additions and 14 deletions
+7 -6
View File
@@ -1,7 +1,9 @@
from tabulate import tabulate
from process.twitter_process import *
from process.twitter_visualization import *
from raw_collect.twitter import *
from report.report import serve_report
from utils import *
@@ -70,13 +72,12 @@ if __name__ == '__main__':
####################
# Data Visualization - Step V1
# Meta-data visualization: Let's see some data about the sample
# Generate all visualization reports and graphs
report_all()
# Who posted the most covid tweets? (covid vs non-covid ratio)
# - Graph histogram of this ratio
# Who has the most covid tweet popularity (popularity of covid vs non-covid tweets ratio)
# - Graph histogram of this ratio
####################
# Serve webpage
serve_report()
####################
# Finalize the program for submission.
+2 -2
View File
@@ -120,8 +120,8 @@ def serve_report() -> None:
return send_from_directory(os.path.join(src_dir, 'resources'), path)
# Run app
webbrowser.open("http://localhost:5000")
app.run()
webbrowser.open("http://localhost:8080")
app.run(port=8080)
if __name__ == '__main__':
+5 -4
View File
@@ -23,7 +23,9 @@ We also counted the number of people speaking each language:
1. To create our samples, we collected a wide range of Twitter users using Twitter's get friends list API endpoint [(documentation)](https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-friends-list) and the follows-chaining technique. We specified one single user as the starting point, obtained the user's friends list, then we picked 3 random users and 3 most followed users from the friend list, add them to the queue, and start the process again from each of them. Because of Twitter's rate limiting on the get friends list endpoint, we can only obtain a maximum of 200 users per minute, with many of them being duplicates. We ran the program for one day and obtained 224,619 users (852.3 MB decompressed). However, only the username, popularity, post count, and language data are kept after processing (filtering). The processed user dataset `data/twitter/user/processed/users.json` is 7.9 MB in total. We selected our samples by filtering the results first based on language, selected the top 500 most followed users as `500-pop`, filtered the list again based on post count (>1000) and followers (>150), then selected a random sample of 500 users as `500-rand`.
2. Tweets (ignoring retweets) **TODO**
2. We also downloaded all tweets from our sampled users through the user-timeline API [(documentation)](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline). Due to rate limiting, the program took around 16 hours to finish, and we obtained 7.7 GB of raw data (uncompressed). During processing, for each tweet, we extracted only its date, popularity (likes + retweets), whether it is a retweet, and whether it is COVID-related. The text of the tweets are not retained, and the processed data directory `data/twitter/user-tweets/processed/` is 141.6 MB in total.
3. We also used the COVID-19 daily cases data published by New York Times [[3]](#ref3) to compare with peaks and throughs in our frequency over date graph.
## Computation & Filtering
@@ -181,10 +183,9 @@ Despite efforts to filter out noise or normalize the graph discussed in the [met
# Conclusion
_**TODO**_: A conclusion
In summary, key findings in our research include that while news channels post about COVID-19 more frequently (Median = 7.6%), average Twitter users and most popular users don't post very much (Median ≤ 1.5%). And while COVID-posting frequencies for `eng-news` and `500-pop` fluctuates with the number of new cases in the U.S., average Twitter users' COVID-posting frequency dropped and continued to decrease since Jun 2020. And these posts were not as popular (not liked or commented as much) as users' other posts (Median ≤ 0.87).
* Why are these findings important? What do they reveal?
* Connect to larger theme?
These findings might not be surprising, but they might have again demonstrated people's ability to adapt to new environments. The attentional effect of the start of COVID-19 might be similar to the satisfaction from buying a new thing or the grief from losing something, they all fade over time as we adapt to them. Even though people focused a lot of attention on COVID-19 when new information first became available from March 2020, people's interest in these topics decreased as we adapt to the new norm with COVID-19 in three months, demonstrated by the quickly decreasing posting rates. Or, for the audience, rather than liking or commenting on COVID-19 posts, they might have quickly scrolled through them in favor of more interesting posts. It is fascinating that we can learn to adapt to such a devastating change in our environment in only three months.
## TODO
+7 -2
View File
@@ -24,7 +24,7 @@ sorting=nyt
\section{Problem Description and Research Question}
\indent
We have observed that there have been increasingly more voices talking about COVID-19 since the start of the pandemic. However, with recent policy changes in many countries aiming to limit the effect of COVID-19, it is unclear how peoples discussions would react. Some people might be inclined to believe that the pandemic is starting to end so that discussing it would seem increasingly like an unnecessary effort. In contrast, others might find these policy changes controversial and want to voice their opinions on them even more. Also, even though COVID-related topics are almost always on the news, some news outlets might intentionally cover them more frequently than others. For the people watching the news, some people might find these news reports interesting, while others cant help but switch channels. So, how peoples interest in listening about or discussing COVID-related topics changes over time is not very clear. \textbf{Our goal is to analyze how peoples interest in COVID-related topics changes and how frequently people have discussed COVID-related issues in the two years since the pandemic started.} Also, different social media platforms might induce people to view the pandemic differently. For example, we dont know whether people on open social media platforms such as Twitter, where everyone can view your posts, might be more or less inclined to post or COVID-related content than people on closed social media platforms such as Instagram, Wechat, or Telegram. Also, people or news outlets with different numbers of followers or viewers might have different inclinations too. \textbf{So, we also aim to compare peoples interests in posting about COVID-related topics between platforms and popularity.}
We have observed that there have been increasingly more voices talking about COVID-19 since the start of the pandemic. However, different groups of people might view the importance of discussing the pandemic differently. For example, we don't know whether the most popular people on Twitter will be more or less inclined to post COVID-related content than the average Twitter user. Also, while some audience finds these content interesting, others quickly scroll through them. \textbf{So, we aim to compare people's interests in posting coronavirus content and the audience's interests in viewing them between different groups.} Also, with recent developments and policy changes toward COVID-19, it is unclear how people's discussions would react. Some people might believe that the pandemic is starting to end so that discussing it would seem increasingly like an unnecessary effort, while others might find these policy changes controversial and want to voice their opinions even more. Also, even though COVID-related topics are almost always on the news, some news outlets might intentionally cover them more frequently than others. For the people watching the news, some people might find these news reports interesting, while others can't help but switch channels. So, how people's interest in listening or discussing COVID-related topics changes over time is not very clear. \textbf{Our second goal is to analyze how people's interest in COVID-related topics changes and how frequently people have discussed COVID-related issues in the two years since the pandemic started.}
\section{Dataset Used}
\indent
@@ -33,6 +33,9 @@ sorting=nyt
\item[1.] A wide range of Twitter users: We used twitter's get friends list API \href{https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-friends-list}{(documentation)} and the follows-chaining technique to obtain a wide range of twitter users. This technique is explained in the Computational Overview section. Due to rate limiting, we ran the program for one day and obtained 224,619 users (852.3 MB decompressed). However, only the username, popularity, post count, and language data are used, and the processed (filtered) user dataset \C{data/twitter/user/processed/users.json} is only 7.9 MB in total.
\item[2.] All tweets from sampled users: We selected two samples of 500 users each (the sampling method is explained in the Computational Overview section), and we used the user-timeline API \href{https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline}{(documentation)} to obtain all of their tweets. Due to rate limiting, the program took around 16 hours to finish, and we obtained 6.07 GB of raw data (uncompressed). During processing, we reduced the data for each tweet to only its date, popularity (likes + retweets), whether it is a retweet, and whether it is COVID-related. The text of the tweets are not retained, and the processed data directory \C{data/twitter/user-tweets/processed} is only 107.9 MB in total.
\item[3.] Top 100 news twitter accounts by Bremmen (https://memeburn.com/2010/09/the-100-most-influential-news-media-twitter-accounts/)
\item[4.] COVID-19 daily new cases data by New York Times (https://github.com/nytimes/covid-19-data).
\end{itemize}
\section{Computational Overview}
@@ -40,6 +43,8 @@ sorting=nyt
\subsection*{Data Gathering}
\indent
However, since twitter limited the request rate of this API endpoint to 1 request ($\le 200$ users) per minute, we ran the program continuously for one day to gather this data.
We plan to transform different platforms user posting data, all with unique formats, into data in a platform-independent data model to store and compare. When processing social media data, we will convert platform-dependent keywords such as \texttt{favorites}, \texttt{retweets}, or \texttt{full\_text} on Twitter and \texttt{content}, \texttt{views}, or \texttt{comments} on Telegram into our unique platform-independent model with keywords such as \texttt{popularity} and \texttt{text}. And we will store all processed data in \textbf{JSON} before analysis. As for the raw data from different social media platforms, we plan to gather Twitter data using the \textbf{Tweepy} library and Telegram channels data using \textbf{python-telegram-bot}. However, unfortunately, there are no known libraries for Wechat Moments. We will try to obtain Wechat data through package capture using pyshark, but that might not be successful.
For news outlet data, we plan to use \textbf{requests} to obtain raw HTML from different listing sites, extract news articles titles, publishers, and publishing dates with \textbf{regex}, and store them using JSON. We will convert different HTML formats from different news publishers sites into our platform-independent news model.
@@ -67,7 +72,7 @@ sorting=nyt
First, we originally planned to include news reports from separte journal websites in our analysis as well. However, when we gather the data, we found that there is no way to identify the popularity of a news report published on a journal website. So, we decided to gather the tweets of news accounts on Twitter instead, which will also have the benefit of having the same data gathering and analysis process for each news channel.
Second, we originally planned to compare people's interests in posting COVID-related topics between different platforms because we thought Chinese people don't rely on Twitter as much since Twitter is blocked in China. However, there isn't any publically available Wechat API that we can use for analysis, and Wechat is also more private, with access to someone's postings limited to only their friends. (It is as if everyone on Twitter has a locked account). Therefore, it is impractical to gather data from Wechat. And, for Telegram channels, the postings does not have a like feature, and might not have commenting feature unless the channel host specifically set up for it using a third-party bot. So there isn't a reliable way to obtain popularity data on Telegram as well. So, we were limited to analyzing Twitter as the only platform.
Second, we originally planned to compare people's interests in posting COVID-related topics between different platforms because we thought Chinese people don't rely on Twitter as much since Twitter is blocked in China. However, there isn't any publically available Wechat API that we can use for analysis, and Wechat is also more private, with access to someone's postings limited to only their friends. (It is as if everyone on Twitter has a locked account). Therefore, it is impractical to gather data from Wechat. And, for Telegram channels, the postings does not have a like feature, and might not have commenting feature unless the channel host specifically set up for it using a third-party bot. So there isn't a reliable way to obtain popularity data on Telegram as well. So, instead of comparing between platforms, we compared different groups of people on the Twitter platform.
\section{Discussion}
\indent