I'm learning python and lxml toolkit. I need process multiple .htm files in the local directory (recursively) and remove unwanted tags include its content (divs with IDs "box","columnRight", "adbox", footer", div class="box", plus all stylesheets and scripts). Can't figure out how to do this. I have code that list all .htm files in directory:
#!/usr/bin/python
import os
from lxml import html
import lxml.html as lh
path = '/path/to/directory'
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(".htm"):
doc=lh.parse(filename)
So I need to add part, that creates a tree, process html and remove unnecessary divs, like
for element in tree.xpath('//div[@id="header"]'):
element.getparent().remove(element)
how to adjust the code for this?
html page example.