I am saving some html files locally and I want to strip them from all unnecessary information. This essentially means I want to remove all <script> and <style> tags and their respective contents.
I use selenium webbrowser and I can access the page source with something like this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://somesite.com')
html = driver.page_source
I had three different ideas:
Use jQuery to remove the unncecessary tags and then access the page_source attribute to cache it locally. Something along:
driver.execute_script("""$('style, script').remove()""") cache(driver.page_source)
But this code won't work because I can't cripple the page source internally, because I need the site to be intact for further interactions with selenium driver instance. + use lxml to parse the driver.page_source and then remove all unwanted information. After this access the modified page source and cache locall. In code:
parsed = lxml.html.fromstring(driver.page_source)
for bad, worse in zip(parsed.xpath('//script'), parsed.xpath('//style')):
bad.getparent().remove(bad)
worse.getparent().remove(worse)
cache(parsed.text)
# Problem: parsed.text is empty :/ How can I access the modified source? Remember, I don't need no text_content()
+ Modify and truncate the source directly in webdriver and then access the page_source attribute. But there aren't any methods to alter the dom in webdriver instances.
I guess the lxml approach is the best one, because regardless how I try to wrap my head around the problem, I shouldn't mess up the webdriver instance, since I need to interact further with it. Did I miss something with the lxml thing?
Cheers