Scraping links from a website using python / beautiful soup for a Kodi addon

Question

The website I'm trying to scrape media links from (for a Kodi addon) doesn't have much in the way of class etc. markers, but each link is in some sort of unique layout.

I have created the basic Kodi addon from another working one, but I'm having issues getting Python/BeautifulSoup scraping the links. Other addons use the class etc. headers, but the website I'm trying to scrape from doesn't use much in the way of this.

I've tried all sorts of forums with no luck, most Kodi addons forums are old and not very active. The guides I've looked at go from step 1 to step 1000 very quickly it seems and the examples it gives aren't relevant. I've looked at 30 or so different addons thinking that should help, but I can't work it out.

The media links, episode titles, descriptions and images I'm trying to scrape are listed on www.thisiscriminal.com/episodes

The full addon I've done so far is at Github-repository

I can see in the source they're clearly set out (see code)

I basically just need to be able to parse a website, find the below bits for each episode, populate them as links on the kodi addon page and then list the next one underneath. Any help would be greatly appreciated. I've spent about 3 straight days trying to do this and am very both very glad and annoyed that I dropped out of that IT degree I started in 2002.

WEBSITE CODE I NEED TO PULL

(episode image)
<img width="300" height="300" ...
https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png" ../>    

(episode title)
<h3><a href="https://thisiscriminal.com/episode-115-cecilia-5-24-19/">Cecilia</a></h3>

(episode number)
<h4>Episode #115</h4>

(episode link)
<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3"

(episode description)
</header>When Cecilia....</article>

CODE

import requests
import re
from bs4 import BeautifulSoup

def get_soup(url):
    """
    @param: url of site to be scraped
    """
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    print "type: ", type(soup)
    return soup

get_soup("https://thisiscriminal.com/episodes")

def get_playable_podcast(soup):
    """
    @param: parsed html page
    """
    subjects = []

    for content in soup.find_all('a'):

        try:
            link = content.find('<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/')
            link = link.get('href')
            print "\n\nLink: ", link

            title = content.find('<h4>Episode ')
            title = title.get_text()

            desc = content.find('div', {'class': 'summary'})
            desc = desc.get_text()


            thumbnail = content.find('img')
            thumbnail = thumbnail.get('src')
        except AttributeError:
            continue


        item = {
                'url': link,
                'title': title,
                'desc': desc,
                'thumbnail': thumbnail
        }

        #needto check that item is not null here
        subjects.append(item)

    return subjects

2019-06-09 00:05:35.719 T:1916360240 ERROR: Control 55 in window 10502 has been asked to focus, but it can't 2019-06-09 00:05:41.312 T:1165988576 ERROR: EXCEPTION Thrown (PythonToCppException) : -->Python callback/script returned the following error<- - NOTE: IGNORING THIS CAN LEAD TO MEMORY LEAKS! Error Type: Error Contents: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128) Traceback (most recent call last): File "/home/osmc/.kodi/addons/plugin.audio.abcradionational/addon.py", line 44, in desc = soup.get_text().replace('\xa0', ' ').replace('\n', ' ') UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128) -->End of Python script error report<-- 2019-06-09 00:05:41.636 T:1130349280 ERROR: GetDirectory - Error getting plugin://plugin.audio.abcradionational/ 2019-06-09 00:05:41.636 T:1916360240 ERROR: CGUIMediaWindow::GetDirectory(plugin://plugin.audio.abcradionational/) failed

The content of this page is dynamically loaded using javascript; you should look into using Selenium. — Jack Fleeting, Jun 07 '19 at 00:25
@JackFleeting Many thanks. I've seen that come up on some posts, is it python based? I'm not sure what other plugins Kodi can use. — leopheard, Jun 07 '19 at 01:04
Selenium definitely works with python, though I think it also has a JS version. There are other products like PhantomJS and spacy. — Jack Fleeting, Jun 07 '19 at 01:11

score 0 · Answer 1 · answered Jun 07 '19 at 02:33

As Jack pointed out, the page response includes JavaScript code that makes AJAX calls. This code is included in the page response but not executed by requests

While selenium would allow render this for you I would suggest an alternative.

Navigate to the page with any browser (Chrome shown). Press F12 to open Developer Tools

We are interested in the Network Tab. Select XHR as well. Now that Developer Tools is open, press Ctrl + R to reload the page and log the XHR requests.

You should see something like this:

You can inspect each one. I think you would be interested in the /episodes endpoint:

This is a structured, and more specifically, a JSON response. To leverage this endpoint you would simply make an identical GET request with requests.

This can be done simply by:

Right-clicking the response
Selecting Copy -> Copy as cURL (Select cURL (Bash) if given the choice)
Paste it in cURL Converter

QHarr · Accepted Answer · 2019-06-07T04:33:20.123

The good news is that page gets a wp json source load for content and you can issue simple xhr against this. Other answer seems to cover nicely how to find this.

You can then parse info out as you require from that json. The text description is as html within json returned so you can pass that to bs4 and parse as required. Example below. You can explore the json object in relation to Cecilia here, or, paste the following into a json viewer:

{'title': 'Cecilia', 'excerpt': {'short': 'When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another...', 'long': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your...", 'full': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your first purchase..."}, 'content': '<p data-pm-context="[]">When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don&#8217;t.”</p>\n<p data-pm-context="[]">Sponsors:</p>\n<p><strong>Article</strong> Visit <a href="http://article.com/criminal">article.com/criminal </a>to get $50 off your first purchase of $100 or more.</p>\n<p><a href="https://www.therealreal.com/"><strong>The Real Real</strong></a> Shop in-store, online, or download the app, and get 20% off select items with the promo code REAL.</p>\n<p><strong>Simplisafe</strong> Protect your home today and get free shipping at <a href="http://SimpliSafe.com/CRIMINAL">SimpliSafe.com/CRIMINAL</a></p>\n<p><strong>Squarespace</strong> Try <a href="http://Squarespace.com/criminal">Squarespace.com/criminal </a>for a free trial and when you’re ready to launch, use the offer code INVISIBLE to save 10% off your first purchase of a website or domain.</p>\n<p><strong>Sun Basket</strong> Go to <a href="http://sunbasket.com/criminal">sunbasket.com/criminal </a>to get up to $80 off today!</p>\n', 'image': {'thumb': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-150x150.png', 'medium': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-300x300.png', 'large': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-1024x1024.png', 'full': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png'}, 'episodeNumber': '115', 'audioSource': 'https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3', 'musicCredits':"FALSE", 'id': 3129, 'slug': 'episode-115-cecilia-5-24-19', 'date': '2019-05-24 19:43:44', 'permalink': 'https://thisiscriminal.com/episode-115-cecilia-5-24-19/', 'next':"None", 'prev': {'slug': 'episode-114-philip-and-becky', 'title': 'Episode 114: Philip and Becky (5.10.2019)'}}

The request is a queryString url so you can alter the number of items to return and within the response you will see listed the total number of pages so you know how many requests are needed to return all content.

If you look here

posts=1000&page=1

you can see two parameters you can alter accordingly.

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000&page=1').json()

for post in r['posts']:
    title = post['title']
    soup = bs(post['content'])
    desc = soup.select_one('p').text  # soup.get_text().replace('\xa0', ' ').replace('\n', ' ')
    img = post['image']['full']
    episode_link = post['audioSource'] #sure this is what you wanted?
    episode_number = post['episodeNumber']

Many thanks once again @QHarr & modelbuilder42 - but I've still no idea what I'm doing. I've pasted that code in and the code it's telling me there's an unexpected indent. Even if I did successfully grab the results I wanted, I don't know how to produce them into the Kodi addon. I think I might just have to pay someone via one of those freelance coder session websites or something. Thanks though! — leopheard, Jun 07 '19 at 20:21
I'm guessing it is how you have integrated code as above is correctly indented. — QHarr, Jun 07 '19 at 20:22
nowhere near unfortunately. I have the HTML saved in a more readable .json format apparently, but still no idea where to paste it. The code you suggested comes out with these Kodi log errors (added to end of original post) — leopheard, Jun 09 '19 at 04:10
Also, even if I do have a .json file, does that mean it will scrape those fields every time the app is opened dynamically i.e. produce up to date results? — leopheard, Jun 09 '19 at 04:18

Scraping links from a website using python / beautiful soup for a Kodi addon

CODE

2 Answers2