The website I'm trying to scrape media links from (for a Kodi addon) doesn't have much in the way of class etc. markers, but each link is in some sort of unique layout.
I have created the basic Kodi addon from another working one, but I'm having issues getting Python/BeautifulSoup scraping the links. Other addons use the class etc. headers, but the website I'm trying to scrape from doesn't use much in the way of this.
I've tried all sorts of forums with no luck, most Kodi addons forums are old and not very active. The guides I've looked at go from step 1 to step 1000 very quickly it seems and the examples it gives aren't relevant. I've looked at 30 or so different addons thinking that should help, but I can't work it out.
The media links, episode titles, descriptions and images I'm trying to scrape are listed on www.thisiscriminal.com/episodes
The full addon I've done so far is at Github-repository
I can see in the source they're clearly set out (see code)
I basically just need to be able to parse a website, find the below bits for each episode, populate them as links on the kodi addon page and then list the next one underneath. Any help would be greatly appreciated. I've spent about 3 straight days trying to do this and am very both very glad and annoyed that I dropped out of that IT degree I started in 2002.
WEBSITE CODE I NEED TO PULL
(episode image)
<img width="300" height="300" ...
https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png" ../>
(episode title)
<h3><a href="https://thisiscriminal.com/episode-115-cecilia-5-24-19/">Cecilia</a></h3>
(episode number)
<h4>Episode #115</h4>
(episode link)
<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3"
(episode description)
</header>When Cecilia....</article>
CODE
import requests
import re
from bs4 import BeautifulSoup
def get_soup(url):
"""
@param: url of site to be scraped
"""
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print "type: ", type(soup)
return soup
get_soup("https://thisiscriminal.com/episodes")
def get_playable_podcast(soup):
"""
@param: parsed html page
"""
subjects = []
for content in soup.find_all('a'):
try:
link = content.find('<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/')
link = link.get('href')
print "\n\nLink: ", link
title = content.find('<h4>Episode ')
title = title.get_text()
desc = content.find('div', {'class': 'summary'})
desc = desc.get_text()
thumbnail = content.find('img')
thumbnail = thumbnail.get('src')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
#needto check that item is not null here
subjects.append(item)
return subjects
2019-06-09 00:05:35.719 T:1916360240 ERROR: Control 55 in window 10502 has been asked to focus, but it can't 2019-06-09 00:05:41.312 T:1165988576 ERROR: EXCEPTION Thrown (PythonToCppException) : -->Python callback/script returned the following error<- - NOTE: IGNORING THIS CAN LEAD TO MEMORY LEAKS! Error Type: Error Contents: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128) Traceback (most recent call last): File "/home/osmc/.kodi/addons/plugin.audio.abcradionational/addon.py", line 44, in desc = soup.get_text().replace('\xa0', ' ').replace('\n', ' ') UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128) -->End of Python script error report<-- 2019-06-09 00:05:41.636 T:1130349280 ERROR: GetDirectory - Error getting plugin://plugin.audio.abcradionational/ 2019-06-09 00:05:41.636 T:1916360240 ERROR: CGUIMediaWindow::GetDirectory(plugin://plugin.audio.abcradionational/) failed