In this article, I’m going to show you how to use Pushshift to scrape a large amount of Reddit data and create a dataset. I define “large” as a set of data between 50,000–500,000 items. Data queries for over 500,000 items will be covered in a separate article as these will increase the risk of an out of memory error occurring, depending on the amount of available memory.

Why should you create a dataset from Reddit data? Reddit provides a public forum for communities with similar interests to discuss and exchange ideas, with a large majority of this data being…

