3

I want to collect the comment on reddit an I use praw to get an ID of a document like a2rp5i . For example, I already collect a set of ID like

docArr=
['a14bfr', '9zlro3', 'a2pz6f', 'a2n60r', 'a0dlj3']
my_url = "https://old.reddit.com/r/Games/comments/a0dlj3/"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
content_containers = page_soup.findAll("div", {"class":"md"})
timestamp_containers = page_soup.findAll("p", {"class":"tagline"})
time = timestamp_containers[0].time.get('datetime')

I want to use time as my filename and I want to save content as a txt file

outfile = open('%s.txt' % time , "w") 
for content_container in content_containers:
    if content_container == "(self.games)":
        continue
    data = content_container.text.encode('utf8').decode('cp950', 'ignore')
    outfile.write(data)
outfile.close()

This attempt is fine for me to save only one url But I want to save ID in docArr at the same

url_test = "https://old.reddit.com/r/Games/comments/{}/"
for i in set(docArr):
    url = url_test.format(i)

It get me the url right. But how do I save time and content_container of all of the url in docArr at once?

wayne64001
  • 399
  • 1
  • 3
  • 13
  • let me know if my answer is missing something – ewwink Dec 07 '18 at 07:52
  • If you're using PRAW there's no need for scraping reddit. That adds extra work, (to program, and for bandwidth) which adds extra potential bugs. The dict PRAW returns contains `created_utc` . a `created_utc` that you can use to create a datetime instance. `datetime.utcfromtimestamp(comment.created_utc)` – ninMonkey Aug 21 '19 at 10:42

1 Answers1

0

you just need to add indent to current code

for i in docArr:
    url = url_test.format(i)
    uClient = uReq(url)
    ....
    ....
    outfile = open('%s.txt' % time , "w") 
    for content_container in content_containers:
        ....
        ....
    outfile.close()
ewwink
  • 18,382
  • 2
  • 44
  • 54
  • Note the encoding is broken. `data = content_container.text.encode('utf8').decode('cp950', 'ignore')` is saying to take the encoding agnostic string (unicode string) and convert it to bytes using the utf-8 encoding. Then you read the utf-8 encoding bytes as cp950. If you want cp950 encoding, what you want is to take the unicode string, then `unicode_string.encode('cp950')`. – ninMonkey Aug 21 '19 at 10:38