0

I want to practice how to parse values from website. However, when I parse the comments from Steam, I only can parse the first page of comment. How do I crawl all the comments?

Here is my code:

from bs4 import BeautifulSoup
import urllib.request

url = 'http://steamcommunity.com/games/dota2/announcements/detail/1449457773770927103'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
for t in soup.body.find_all('div', attrs = {'class':'commentthread_comment_text'}):    
    print(t.text)
akash karothiya
  • 5,736
  • 1
  • 19
  • 29
Jammy
  • 87
  • 1
  • 8
  • Welcome to the unmercy world of `scraping`. Either you will find a hook to the comments or use a webdriver as `selenium` to get the data and simulate a `click` – user1767754 Nov 28 '17 at 07:31

1 Answers1

0

If you open up your dev console, click on network, then click on the next button, you'll see that the page is making a request to the following url:

https://steamcommunity.com/comment/ClanAnnouncement/render/103582791433224455/1449457773770927103/

EDIT:

In the response body you'll see the following 3 properties: start, pagesize, total_count. If you keep attaching query parameters, you'll be able to fetch all comments: https://steamcommunity.com/comment/ClanAnnouncement/render/103582791433224455/1449457773770927103/?start=10

https://steamcommunity.com/comment/ClanAnnouncement/render/103582791433224455/1449457773770927103/?start=20

fodma1
  • 3,485
  • 1
  • 29
  • 49
  • I don't get it. The information of url you provided only contains the comment of the first page. I still can't parser all of the comments. – Jammy Nov 28 '17 at 07:44
  • Thank you, it works. Can you teach me how to find the above url, please? I want to learn how to find the url so that I can solve it by myself next time. Another question, does the comments store in json file? Can I just load the json file, then find the values(comments) I want? Thank you so much!! – Jammy Nov 28 '17 at 09:03
  • Finding this URL is a guesswork. I was pretty sure, those comments are loaded via AJAX, so I just inspected the requests the page made. I'd try to look for those numerical id's in the page source. The second one is the same as the announcement id in your URL. the first id (starting with 103) might be the id of dota2. Unfortunately, there's no way to extract the data as JSON, but I'm not familiar with their API. If you think this answer helped you to get closer to your solution, place accept + upvote it. – fodma1 Nov 28 '17 at 09:23
  • One more question, how do I get the comment from the content of URL? The source code in url seems like a little difference with HTML. There are a lot of "\r\n\t" that will affect me to parser the data. Do I need to use Regular Expression to replace them first? Thank you! – Jammy Nov 28 '17 at 17:57
  • Well I just tried to parse it with BS. It's an ugly piece of HTML for sure, but it looks like they don't care about it at Steam. When you try to parse it, you'll find loads of empty string children or `\n` children. One thing you can do is to minify the content with BS: https://stackoverflow.com/a/24435820/2419215 I tried it with bs, it looks somewhat better, although the line endings stay in place. – fodma1 Nov 28 '17 at 19:54
  • Thank you so much. I try to get request and to load the information as json file, then successfully getting the values I want without '\r\n\t'. Thanks! – Jammy Dec 01 '17 at 19:57