Beautifull soup takes too much time for text extraction in common crawl data

Question

I have to parse html content in common crawl data-set (warc.gz files). I have decided to used bs4 (Beautifulsoup) module as mostly people suggest it. Following is the code snippet to get text:

from bs4 import BeautifulSoup

soup = BeautifulSoup(src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]
txt = soup.get_text().encode('utf8')

without bs4, one file is completely processed in 9 minutes (test case) but If I use bs4 to parse text, then Job is finished in about 4 hours. What this is happening. Is there any better solution other than bs4? Note: bs4 is class that contains many modules like Beautifilsoup.

You can use `lxml` or `re` (regular expression) which will be faster than `bs4` — salmanwahed, Jan 17 '17 at 08:28
Any example for lxml or re that can remove html, script and styles tags? — Hafiz Muhammad Shafiq, Jan 17 '17 at 08:53

score 1 · Accepted Answer · answered Jan 17 '17 at 09:52

Here the main time consuming part is the extracting of tags in the list compression. With lxml and python regular expression you can do it like the followings.

import re

script_pat = re.compile(r'<script.*?<\/script>')

# to find all scripts tags
script_pat.findall(src)

# do your stuff
print re.sub(script_pat, '', src)

Using lxml you can do it like this:

from lxml import html, tostring
et = html.fromstring(src)

# remove the tags
[x.drop_tag() for x in et.xpath('//script')]

# do your stuff
print tostring(et)

Beautifull soup takes too much time for text extraction in common crawl data

1 Answers1