How to use bs4 to craw multiple page at the same time?

Question

I want to collect the comment on reddit an I use praw to get an ID of a document like a2rp5i . For example, I already collect a set of ID like

docArr=
['a14bfr', '9zlro3', 'a2pz6f', 'a2n60r', 'a0dlj3']
my_url = "https://old.reddit.com/r/Games/comments/a0dlj3/"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
content_containers = page_soup.findAll("div", {"class":"md"})
timestamp_containers = page_soup.findAll("p", {"class":"tagline"})
time = timestamp_containers[0].time.get('datetime')

I want to use time as my filename and I want to save content as a txt file

outfile = open('%s.txt' % time , "w") 
for content_container in content_containers:
    if content_container == "(self.games)":
        continue
    data = content_container.text.encode('utf8').decode('cp950', 'ignore')
    outfile.write(data)
outfile.close()

This attempt is fine for me to save only one url But I want to save ID in docArr at the same

url_test = "https://old.reddit.com/r/Games/comments/{}/"
for i in set(docArr):
    url = url_test.format(i)

It get me the url right. But how do I save time and content_container of all of the url in docArr at once?

If you're using PRAW there's no need for scraping reddit. That adds extra work, (to program, and for bandwidth) which adds extra potential bugs. The dict PRAW returns contains `created_utc` . a `created_utc` that you can use to create a datetime instance. `datetime.utcfromtimestamp(comment.created_utc)` — ninMonkey, Aug 21 '19 at 10:42

score 0 · Answer 1 · answered Dec 07 '18 at 07:09

0

you just need to add indent to current code

for i in docArr:
    url = url_test.format(i)
    uClient = uReq(url)
    ....
    ....
    outfile = open('%s.txt' % time , "w") 
    for content_container in content_containers:
        ....
        ....
    outfile.close()

answered Dec 07 '18 at 07:09

ewwink

18,382
2
44
54

Note the encoding is broken. `data = content_container.text.encode('utf8').decode('cp950', 'ignore')` is saying to take the encoding agnostic string (unicode string) and convert it to bytes using the utf-8 encoding. Then you read the utf-8 encoding bytes as cp950. If you want cp950 encoding, what you want is to take the unicode string, then `unicode_string.encode('cp950')`. – ninMonkey Aug 21 '19 at 10:38

How to use bs4 to craw multiple page at the same time?

1 Answers1