Introduction

Politics and governmental rule have always been a source of discussion and debate in societies throughout history. Today, debate and conversation plays a key part in democracy. In the United States, voters are encouraged to engage in conversations with like-minded peers and dissidents to understand others’ opinions, make informed decisions, and allow all voices to be heard. In the age of the internet, there are more opportunities than ever to engage in discussions regarding politics. Reddit is a popular social media platform full of communities engaging in regular political discussions. Users participate in topical forums called “subreddits” to voice their opinions, share information, and engage with others. Political discussions online point to real-world events and their repercussions in the thoughts and opinions of everyday Americans.

Project Overview

In this project, we aim to use Reddit data from June 2023 to July 2024 to understand trends and changes in opinion in U.S. political discussions. We will analyze data from political subreddits to understand how different subreddits engage in political discussion and how political discussions compare with real-world political sentiments such as presidential job approval. We will identify which political subreddits tend to have more contraversial discussions, how often posts lean towards extremism, and how distinctive conversations are among different political leanings. We will also analyze whether subreddits tend toward positive or negative conversations. As part of our analysis, we will understand how these trends change over the course of the year and how these trends align with or are impacted by real-world political events.

In this project, our analysis will take a variety of forms to make sense of the vast amount of Reddit data. We will begin with Exploratory Data Analysis (EDA) to understand the data and uncover patterns. Next, we will move to Natural Language Processing (NLP) where we will employ specific techniques to analyze textual data and gather advance insights. After that, we will utilize Maching Learning (ML) to make predictive models based on Reddit posts to understand patterns in subreddits and gain new insights. Finally, we will summarize our findings to highlights the key takeaways from our analysis of political discussions.

For our political dataset, we will be using the following six subreddits outline in Table 1.

Table 1: Overview of subreddits used for analysis throughout the project

Subreddits Used in Analysis
Overview of political subreddits used as the analytical dataset for the project
Subreddit	Description	Number of Posts^*	Number of Comments^*	Number of Members^†
r/Conservative	Largest conservative subreddit, dedicated to discussing conservative issues and values	113,148	2,160,974	1.1M
r/democrats	Discussion for up to get Democrats elected up and down the ballot	23,289	322,764	481K
r/Liberal	News about Liberals and Democrats	4,527	73,665	120K
r/Libertarian	Discussion of libertarianism and its ideologies	14,309	364,029	504K
r/politics	News and discussion about U.S. politics	140,670	11,541,659	8.7M
r/Republican	Partisan place for Republicans to discuss issues with other Republicans	183,000	116,925	205K
^* Data from June 2023 to July 2024
^† Data from November 14, 2024

Using these subreddits, we will conduct a thorough analysis of political discussions on Reddit. In order to further distinguish which data is most relevant to our research, we filtered all data from the listed subreddits to include only posts that mention “Trump” or “Biden” in the title or body of the post. This will allow us to focus on national U.S. political discussions related to the two most recent U.S. Presidents. For details on our specific business goals, see the following section.

Detailed Project Business Goals

EDA

Idea 1: Subreddit Political Engagement

Business goal: Identify which subreddits frequently discuss the President and U.S. politics to understand the spread of political discourse across Reddit.

Technical proposal: Use PySpark to filter and aggregate posts mentioning the President, U.S. politics, and elections, grouping by subreddit. Analyze and visualize the distribution of posts across subreddits to see whether discussions are concentrated in subreddits dedicated to politics (such as r/politics) or are prevalent in general-interest subreddits. Normalize the data to account for differences in subreddit sizes.

Idea 2: Controversiality vs. Comment Count

Business goal: Analyze the relationship between the controversiality of posts and comment count to understand user engagement during politically charged discussions.

Technical proposal: Use the controversiality metric encoded in the data to identify controversial posts. Compare the average comment count and score of controversial posts to non-controversial posts. Compare the distribution of controversial posts across different subreddits to identify which subreddits tend to have more controversial discussions.

Idea 3: Distribution of Political Contributions by Top Authors

Business goal: Examine the distribution of political contributions mentioning Biden, Harris, or Trump in each different subreddit by the top authors. Are there a few people making up the majority of the posts?

Technical proposal: We will detect the political contributions by finding the Keywords Trump, Biden, and Harris. We then determine the top posters of this political content for each subreddit, and calculate the percent of the total political posts that are authored by these contributors.

NLP

Idea 4: Comparison Between Reddit Sentiment and Presidential Job Approval Rates

Business Goal: Analyze how sentiment trends on political subreddits focused on Trump and Biden correlate with broader public opinion as measured by Presidential Job Approval Rates. This analysis aims to understand whether shifts in online sentiment, as expressed in subreddit discussions, reflect changes in real-world political perceptions or highlight discrepancies between online and offline public opinion over time.

Technical Proposal: Apply sentiment analysis to submissions from the selected political subreddits. Aggregate positive sentiments for each month on Trump and Biden related submissions. Compare these sentiment trends with monthly Presidential Job Approval Rates to see how subreddits sentiments align with the public political views. Conduct comparative statistical test to see if there are any correlations.

Idea 6: Impact of Dominant Terms on Shaping Political Discussions on Reddit

Business Goal: Identify the most impactful terms in shaping political discussions on Reddit by quantifying their importance across subreddits. This analysis seeks to uncover how specific terms drive or influence political conversations in various communities.

Technical Proposal: Apply Term Frequency-Inverse Document Frequency (TF-IDF) analysis to identify the most significant terms from a subset of key words derived from Count Vectorizer and LDA outputs. This technique will calculate the relative importance of words within specific subreddits, factoring in their prevalence across the entire Reddit dataset. Combine TF-IDF results with sentiment analysis and topic modeling to contextualize how these terms contribute to shaping subreddit-specific political narratives.

Idea 7: Identifying the Most Extreme Subreddits in Political Discourse

Business Goal: Determine which political subreddits foster the most extreme or distinct discussions by analyzing sentiment, frequently used terms, and dominant topics. The goal is to identify the communities that generate the most distinct and possibly divisive political narratives.

Technical Proposal: Use the results from sentiment analysis, Count Vectorization, and LDA to identify various trends in sentiments, frequently used words, and topic-specific terms within the submissions. Compare these results across subreddits by quantifying differences in sentiment distributions, lexical diversity, and topic extremity.

ML

Idea 8: Subreddit Prediction from Submission Text

Business goal: Predict the most likely subreddit for a given post based on its text content to improve content recommendations and classification.

Technical proposal: Train a multi-class classifier on text data with labels as subreddit names. Use embeddings or TF-IDF vectors as features. Train across political subreddits and evaluate the model’s performance on unseen data. Analyze which words or phrases contribute most to subreddit prediction, and evaluate model’s performance with a confusion matrix and classification metrics.

Idea 9: Predicting Popularity of Posts

Business goal: Identify which factors contribute to a post’s popularity (score) to better understand what drives user engagement in political discussions.

Technical proposal: Use post score as the target variable. Use post text, post title, comment count, length, and subreddit name as features. Make sure to normalize all values for subreddit size, since posts in larger subreddits will automatically have higher scores. Train a regression model to predict score, a good proxy for popularity, using data from political subreddits. Evaluate the model’s performance based on \(RMSE\) and \(R^2\), and analyze feature importance (if possible) to understand which factors most influence post popularity.

Idea 10: Predicting Political Leaning of Comments and Posts

Business goal: Automatically classify comments and posts as left-leaning, right-leaning, or neutral to understand the distribution of political perspectives within and across subreddits.

Technical proposal: Utilize as pre-trained model to predict the political leanings of comments and posts. Identify a pre-trained model that is effective at classifying political text based on its political leaning. Apply the model to comments and posts in political subreddits, then analyze the distribution of political leanings across subreddits. Compare the distribution of political leanings in different subreddits with the stated political affiliation of the subreddit to assess alignment.

Idea 11: Predicting Number of Comments

Business goal: Develop a model to predict the if a post will have a small amount, a moderate amount or many comments.

Technical proposal: Create a categorical variable for the number of comments linked to a submission based on percentile to classify if the post has a large, moderate, or small number of comments. Train a machine learning model to predict the number of comments, apply the model to the submissions in the dataset, and analyze the results. Calculate the most important features in predicting the comment number. Compare this across subreddits and other features.