Beautifulsoup4 performance raspberry pi3

Question

I am making a Kodi addon that i will run on my raspberry pi3. In my addon i scrape information from a website so i can fill a list of items. Everything i have right now is working but when i deploy it on my raspberry pi 3 the performance becomes an issue. It takes 15 seconds before the webpage is parsed

soup = BeautifulSoup(response, "html.parser", parse_only=tiles) << this line

I already use soupstrainer to improve performance but this did not have the impact i was hoping for.

    _VRT_BASE = "https://www.vrt.be/"

    def __list_videos_az(self):
    joined_url = urljoin(self._VRTNU_BASE_URL, "./a-z/")
    response = urlopen(joined_url)
    tiles = SoupStrainer('a', {"class": "tile"})
    soup = BeautifulSoup(response, "html.parser", parse_only=tiles)
    listing = []
    for tile in soup.find_all(class_="tile"):
        link_to_video = tile["href"]
        li = self.__get_item(tile, "false")
        url = '{0}?action=getepisodes&video={1}'.format(_url, link_to_video)
        listing.append((url, li, True))

    xbmcplugin.addDirectoryItems(_handle, listing, len(listing))
    xbmcplugin.addSortMethod(_handle, xbmcplugin.SORT_METHOD_LABEL_IGNORE_THE)
    xbmcplugin.endOfDirectory(_handle)

def __get_item(self, element, is_playable):
    thumbnail = self.__format_image_url(element)
    found_element = element.find(class_="tile__title")
    li = None
    if found_element is not None:
        li = xbmcgui.ListItem(found_element.contents[0]
                              .replace("\n", "").strip())
        li.setProperty('IsPlayable', is_playable)
        li.setArt({'thumb': thumbnail})
    return li

Could someone tell me how to improve the performance of the program? I was thinking maybe a regex would be faster but alot off people say that you should not parse html this way and putting together the regex is also challenging.

So is there anything i can do to improve my performance?

score 1 · Accepted Answer · answered Mar 10 '17 at 16:47

1

I'd recommend to try lxml parser which is written in C (Cython actually) and generally faster. To obtain this package try to install it from Raspbian (apt-get install python-lxml or pip install lxml) and then move it to your addon. lxml package contains compiled binary modules so it's important to obtain a version for your specific platform.

answered Mar 10 '17 at 16:47

Roman Miroshnychenko

1,496
1
10
16

Thanks for answering, i already tried looking at lxml parser but found it difficult to get it to work. But i think in the end i switched the timings, the scraping takes 3 secs and the for loop 15. Since i put together a regex which i think should be the fastest? And in the end it was still slow because i am adding alot of listitems. I think its related to bug #17304 http://trac.kodi.tv/ticket/17304 . Now i am waiting for an update off librelec before i program/test more. (also separated more code from the addon.py as the other guy said) – Martijn Mar 10 '17 at 19:16
I didn't mean working with `lxml` directly, I meant using it as a tree builder/parser for BeautifulSoup 4. And I don't recommend regexes for complex html parsing. As for the bug, it's for Kodi dev team to fix it. – Roman Miroshnychenko Mar 10 '17 at 20:20
Ill keep it in consideration! And for me to deploy it on my raspberry pi i need a new version of libreelec with the kodi 17.1 build – Martijn Mar 10 '17 at 22:51

score 0 · Answer 2 · answered Mar 06 '17 at 20:49

I suspect you only have one file? First thing would probably be moving that code out of your main file. General guidline is to keep the file your addon.xml is referencing to the possible minimum, as it's the only file not getting cached by the compiler.

Beautifulsoup4 performance raspberry pi3

2 Answers2