I have to parse html content in common crawl data-set (warc.gz files). I have decided to used bs4
(Beautifulsoup) module as mostly people suggest it. Following is the code snippet to get text:
from bs4 import BeautifulSoup
soup = BeautifulSoup(src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]
txt = soup.get_text().encode('utf8')
without bs4
, one file is completely processed in 9 minutes (test case) but If I use bs4
to parse text, then Job is finished in about 4 hours. What this is happening. Is there any better solution other than bs4
?
Note: bs4 is class that contains many modules like Beautifilsoup.