Raising the Bar
Implementing Redis Queues To Improve Throughput
In 2018, a year after I wrote my twitter stream tool, I revisited the project with two goals in mind - I wanted to improve data fidelity and throughput. Last year, I was not doing enough data cleansing and I ended up having to do a ton of manual work to the dataset before I could work with it. A lot of this was due to not encoding the tweets in utf-8, so Emojis and letters from foreign alphabets were troublesome. Additionally, last year, I intended to graph tweets over time, but due to a crash during the third quarter, I was unable to complete the chart. To improve performance, I split read and write into two different mechanisms- the first is a listener which reads tweet from the Twitter API and shoves them into a Redis queue- the second piece is the writer, which pulls the same information out from Redis and writes to either a database or a JSON file.
I am pleased with the result. This is a video of the program streaming about 5k/tweets per minute to the file. The left pane is a Redis monitor, which just prints push/pull operations. The middle and right panes are two instances of the program, streaming tweets mentioning the Eagles or Patriots. Both files, over the course of the Superbowl, grew to about 150mb - 4,000,000 tweets in total. All in all, the effort to refactor this code was successful and I am pleased with how it turned out.