The Language of Politics

Charting the Vocabulary of Reddit's Political Communities

I've always thought, from a political standpoint, Reddit is really interesting. Regardless of the political ideology, a community (or 'subreddit') exists where like-minded individuals discuss current events through that ideological lens. In these subreddits, I noticed that the same problems are discussed, but using very different language.

In an effort to quantify these differences, I use Reddit's open API, PRAW, to scrape the most popular threads from three politically engaged subreddits. I utilize Python to process the comments, Redis to push the words into a PSQL database, and Postgres to form the data objects on this page. This process is automated, running every 15 minutes on an AWS EC2 instance. In order to ensure that the same data is not processed twice, I store the unique comment ID of processed comments and then exlude these ID from the crawler. I only include first or second level comments with an overall positive score. You may notice that words such as 'the, and, then' etc. are not present in this data set. I consulted Wikipedia's list of most common words and did not include the most common ~150 words.

If you are interested, feel free to check out the project on GitHub. Two options are available, you may either select one of three pre-rendered datasets or look up the top words during a specific day.

Available datasets:

Select a date: