Photo by Markus Spiske on Unsplash

In this article, I’m going to show you how to use Pushshift to scrape a large amount of Reddit data and create a dataset. I define “large” as a set of data between 50,000–500,000 items. Data queries for over 500,000 items will be covered in a separate article as these will increase the risk of an out of memory error occurring, depending on the amount of available memory.

Why should you create a dataset from Reddit data? Reddit provides a public forum for communities with similar interests to discuss and exchange ideas, with a large majority of this data being…

Matt Podolak

Software Engineer with an interest in data. Reach out to me on Linkedin —

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store