Python:Getting text from html using Beautifulsoup

Question

I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:

I am using the following code:

def get_single_item_data(item_url):
    sourceCode = requests.get(item_url)
    plainText = sourceCode.text
    soup = BeautifulSoup(plainText)
    for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
        print(item_name.string)

item_url = 'https://www.kaggle.com/titericz'   
get_single_item_data(item_url)

The result is None. The problem is that soup.findAll('h4',{'data-bind':"text: rankingText"}) outputs:

[<h4 data-bind="text: rankingText"></h4>]

but in the html of the link when inspecting this is like:

<h4 data-bind="text: rankingText">1st</h4>. It can be seen in the image:

Its clear that the text is missing. How can I overpass that?

Edit: Printing the soup variable in the terminal I can see that this value exists:

So there should be a way to access through soup.

Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.

When I inspect the source of that page it looks like `
` ... so the result seems correct. — l'L'l, Dec 17 '15 at 14:15
I have edited my question. You can see in the screenshot the value exists. — Mpizos Dimitris, Dec 17 '15 at 14:19
The url in your code is different from the one in your linked example; this was confusing. I would suggest changing the one in your code to match the example or vice versa. — l'L'l, Dec 17 '15 at 14:38

score 4 · Accepted Answer · answered Dec 17 '15 at 15:28

If you aren't going to try browser automation through selenium as @Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:

import re
import json

from bs4 import BeautifulSoup
import requests


response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")

pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)

print profile["ranking"], profile["rankingText"]

Prints:

1 1st

Can I ask what is the purpose of `re.MULTILINE | re.DOTALL` in the `re.compile` function? — Mpizos Dimitris, Dec 21 '15 at 10:20
@MpizosDimitris these are just regular expression flags - [`MULTILINE`](https://docs.python.org/2/library/re.html#re.MULTILINE) enables the multi-line search, [`DOTALL`](https://docs.python.org/2/library/re.html#re.DOTALL) allows `.` match a multi-line character also. — alecxe, Dec 21 '15 at 12:54

steinar · Answer 2 · 2015-12-17T14:27:53.810

3

The data is databound using javascript, as the "data-bind" attribute suggests.

However, if you download the page with e.g. wget, you'll see that the rankingText value is actually there inside this script element on initial load:

<script type="text/javascript"
profile: {
...
   "ranking": 96,
   "rankingText": "96th",
   "highestRanking": 3,
   "highestRankingText": "3rd",
...

So you could use that instead.

edited Dec 17 '15 at 14:27

answered Dec 17 '15 at 13:56

steinar

9,383
1
23
37

Isn't `highestRanking` different than `rankingText`? – l'L'l Dec 17 '15 at 14:19
Yes thanks, highestRanking is 1 while rankingText is "1st". Seems that it's actually "rankingText" he should be looking for. That's entirely based on a quick glance at the HTML. – steinar Dec 17 '15 at 14:22

Tales Pádua · Answer 3 · 2015-12-17T20:16:11.563

I have solved your problem using regex on the plain text:

def get_single_item_data(item_url):
    sourceCode = requests.get(item_url)
    plainText = sourceCode.text
    #soup = BeautifulSoup(plainText, "html.parser")
    pattern = re.compile("ranking\": [0-9]+")
    name = pattern.search(plainText)
    ranking = name.group().split()[1]
    print(ranking)

item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)

This return only the rank number, but I think it will help you, since from what I see the rankText just add 'st', 'th' and etc to the right of the number

score -1 · Answer 4 · answered Dec 17 '15 at 13:47

This could because of dynamic data filling.

Some javascript code, fill this tag after page loading. Thus if you fetch the html using requests it is not filled yet.

<h4 data-bind="text: rankingText"></h4>

Please take a look at Selenium web driver. Using this driver you can fetch the complete page and running js as normal.

Python:Getting text from html using Beautifulsoup

4 Answers4

Linked