0

This is an expand from a question that I posted a week ago (getting text from html using beatifulsoup). It seems that most of the data that I want to extract is data-bind and is not 'stored' when i use soup.findAll. For example taking this link: kaggle/user/results I am trying to get the name of all the competitions the user participated. I am using the following code:

url = 'https://www.kaggle.com/titericz/results'
sourceCode = requests.get(url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText)
for link in soup.findAll('tr'):
    print(link)

So i take the first competition but in the link it seems that the values of name of competition, position in this competition, total competitors etc. are missing while in the html are there. Tried to follow the same procedure with the answer of the question that I link above, but I could not manage it(by using re.compile and pattern.search). Is there a way to accomplish it by using BeatifulSoup? I couldnt find any similar issue on the web.

Community
  • 1
  • 1
Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
  • 1
    wouldn't it be better if you just use the underlying get request, I'm guessing this is it. https://www.kaggle.com/scripts/all/0?userId=54836&sortBy=votes – Alan Francis Dec 21 '15 at 12:31
  • Thanks for the answer. Can you be more clear? What procedure should I follow? – Mpizos Dimitris Dec 21 '15 at 12:34
  • you can get the data directly as json from the url https://www.kaggle.com/knockout/profiles/54836/results and parse it [json parsing in python](https://docs.python.org/2/library/json.html). You wouldn't need beautifulsoup in this case – Alan Francis Dec 21 '15 at 12:38
  • 1
    hey @AlanFrancis that's interesting, how did you found out the underlying get request was that URL¿? How to obtain it? – aDoN Dec 21 '15 at 13:01
  • 1
    you can use firefox with firebug to see all http requests that a web page is making, I just went through a few to find out which request had the data Mpizos wanted. – Alan Francis Dec 21 '15 at 13:16

1 Answers1

2

You can parse the underlying get request, which returns a json string.

here's a small script which will get you started.

import requests
import json

jsonResponse = requests.get("https://www.kaggle.com/knockout/profiles/54836/results")
data = json.loads(jsonResponse.text)
print(data)

for eachData in data:
    print("competition name:", eachData["competition"]["title"])
    print("Rank:", eachData["rank"])
    print("competitors count:", eachData["teamCount"])

the output will be of the format:

 competition name: Digit Recognizer 
 Rank: None
 competitors count: 933
 competition name: The Allen AI Science Challenge 
 Rank: 110 
 competitors count: 486
Alan Francis
  • 1,249
  • 11
  • 17