Python cache html file

Question

I am saving some html files locally and I want to strip them from all unnecessary information. This essentially means I want to remove all <script> and <style> tags and their respective contents.

I use selenium webbrowser and I can access the page source with something like this:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://somesite.com')
html = driver.page_source

I had three different ideas:

Use jQuery to remove the unncecessary tags and then access the page_source attribute to cache it locally. Something along:
```
driver.execute_script("""$('style, script').remove()""")
cache(driver.page_source)
```

But this code won't work because I can't cripple the page source internally, because I need the site to be intact for further interactions with selenium driver instance. + use lxml to parse the driver.page_source and then remove all unwanted information. After this access the modified page source and cache locall. In code:

    parsed = lxml.html.fromstring(driver.page_source)

    for bad, worse in zip(parsed.xpath('//script'), parsed.xpath('//style')):
      bad.getparent().remove(bad)
      worse.getparent().remove(worse)
    cache(parsed.text)
    # Problem: parsed.text is empty :/ How can I access the modified source? Remember, I don't need no text_content()

+ Modify and truncate the source directly in webdriver and then access the page_source attribute. But there aren't any methods to alter the dom in webdriver instances.

I guess the lxml approach is the best one, because regardless how I try to wrap my head around the problem, I shouldn't mess up the webdriver instance, since I need to interact further with it. Did I miss something with the lxml thing?

Cheers

alecxe · Accepted Answer · 2014-05-02T22:47:58.720

2

You can find both script and style tags in a single xpath expression. After removing the tags, get the modified html using lxml.html.tostring():

parsed = lxml.html.fromstring(html)

for bad in parsed.xpath('//script|//style'):
    bad.getparent().remove(bad)

print lxml.html.tostring(parsed)

edited May 02 '14 at 22:47

answered May 02 '14 at 22:39

alecxe

462,703
120
1,088
1,195

Python cache html file

1 Answers1