1

I am saving some html files locally and I want to strip them from all unnecessary information. This essentially means I want to remove all <script> and <style> tags and their respective contents.

I use selenium webbrowser and I can access the page source with something like this:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://somesite.com')
html = driver.page_source

I had three different ideas:

  • Use jQuery to remove the unncecessary tags and then access the page_source attribute to cache it locally. Something along:

    driver.execute_script("""$('style, script').remove()""")
    cache(driver.page_source)
    

But this code won't work because I can't cripple the page source internally, because I need the site to be intact for further interactions with selenium driver instance. + use lxml to parse the driver.page_source and then remove all unwanted information. After this access the modified page source and cache locall. In code:

    parsed = lxml.html.fromstring(driver.page_source)

    for bad, worse in zip(parsed.xpath('//script'), parsed.xpath('//style')):
      bad.getparent().remove(bad)
      worse.getparent().remove(worse)
    cache(parsed.text)
    # Problem: parsed.text is empty :/ How can I access the modified source? Remember, I don't need no text_content()

+ Modify and truncate the source directly in webdriver and then access the page_source attribute. But there aren't any methods to alter the dom in webdriver instances.

I guess the lxml approach is the best one, because regardless how I try to wrap my head around the problem, I shouldn't mess up the webdriver instance, since I need to interact further with it. Did I miss something with the lxml thing?

Cheers

tchrist
  • 78,834
  • 30
  • 123
  • 180
Nikolai Tschacher
  • 1,639
  • 2
  • 17
  • 24

1 Answers1

2

You can find both script and style tags in a single xpath expression. After removing the tags, get the modified html using lxml.html.tostring():

parsed = lxml.html.fromstring(html)

for bad in parsed.xpath('//script|//style'):
    bad.getparent().remove(bad)

print lxml.html.tostring(parsed)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195