3

I have a requirement to analyse all the comments about a subreddit, (e.g. r/dogs, say from 2015 onward) using the Python Reddit API Wrapper (PRAW). I'd like the data to be stored in a JSON.

I have found a way to get all of the top, new, hot, etc. submissions as a JSON using PRAW:

new_submissions = reddit.subreddit("dogs").new()
top_submissions = reddit.subreddit("dogs").top("all")

However, these are only the submissions for the subreddit, and do not include comments. How can I get the comments for these posts as well?

jack.py
  • 362
  • 8
  • 23
ml_learner
  • 31
  • 1
  • 7

1 Answers1

3

You can use the Python Pushshift.io API Wrapper (PSAW) to get all the most recent submissions and comments from a specific subreddit, and can even do more complex queries (such as searching for specific text inside a comment). The docs are available here.

For example, you can use the get_submissions() function to get the top 1000 submissions from r/dogs from 2015:

import datetime as dt
import praw
from psaw import PushshiftAPI

r = praw.Reddit(...)
api = PushshiftAPI(r)

start_epoch=int(dt.datetime(2015, 1, 1).timestamp()) # Could be any date

submissions_generator = api.search_submissions(after=start_epoch, subreddit='dogs', limit=1000) # Returns a generator object
submissions = list(submissions_generator) # You can then use this, store it in mongoDB, etc.

Alternatively, to get the first 1000 comments from r/dogs in 2015, you can use the search_comments() function:

start_epoch=int(dt.datetime(2015, 1, 1).timestamp()) # Could be any date
    
comments_generator = api.search_comments(after=start_epoch, subreddit='dogs', limit=1000) # Returns a generator object
comments = list(comments_generator)

As you can see, PSAW still uses PRAW, and so returns PRAW objects for submissions and comments, which may be handy.

jack.py
  • 362
  • 8
  • 23
  • Thank you for your answer. I will this out with 'limit = None' as I want the whole dataset and not just the top 1000 posts. – ml_learner Nov 23 '21 at 14:19
  • I’m not sure if `limit = None` would work, but you could try not including the limit all together. If that doesn’t work, you can work around it: get the first, say, 1000 posts, then the first 2000 and remove the top 1000 posts from it. Continue this process and combine all the posts into a single list – jack.py Nov 24 '21 at 13:11
  • @jack.py Thanks for your post. How do I get Top N posts by sections like _HOT_ , _NEW_ etc? – Volatil3 Sep 16 '22 at 06:44