-1

I'm learning python and lxml toolkit. I need process multiple .htm files in the local directory (recursively) and remove unwanted tags include its content (divs with IDs "box","columnRight", "adbox", footer", div class="box", plus all stylesheets and scripts). Can't figure out how to do this. I have code that list all .htm files in directory:

#!/usr/bin/python
import os
from lxml import html
import lxml.html as lh

path = '/path/to/directory'
for root, dirs, files in os.walk(path):
    for name in files:
        if name.endswith(".htm"):
        doc=lh.parse(filename)

So I need to add part, that creates a tree, process html and remove unnecessary divs, like

for element in tree.xpath('//div[@id="header"]'):
    element.getparent().remove(element) 

how to adjust the code for this?

html page example.

Lexx Luxx
  • 243
  • 1
  • 7
  • 13
  • It is unclear what the problem is. Are you unable to parse a HTML file? See https://lxml.de/tutorial.html#parsing-from-strings-and-files – mzjn Aug 23 '21 at 12:59
  • @mzjn I'm not sure about the right syntax in my particular case, as the examples are too abstract. – Lexx Luxx Aug 23 '21 at 14:14
  • Syntax for what? The first snippet is not about cleaning up HTML, it is about walking a directory to find files. The second is an attempt to remove elements from a list returned by `xpath()`. What exactly are you struggling with? – mzjn Aug 23 '21 at 14:22
  • I need recursively walk through a directory, find all .htm files, then use something to retrieve the htm pages and parse it, to remove target elements. The 1st snippet wasn't right for the purpose, so I edited. I'm not sure how to proceed, list a tree elements, `tree = html.parse(path)` and to join the 2nd snippet. – Lexx Luxx Aug 23 '21 at 20:05

1 Answers1

1

It's hard to tell without seeing your actual files, but try the following and see if it works:

First you don't need both

from lxml import html
import lxml.html as lh

So you can drop the first. Then

for root, dirs, files in os.walk(path):
    for name in files:
        if name.endswith(".htm"):           
           tree = lh.parse(name)
           root = tree.getroot()
           for element in root.xpath('//div[@id="header"]'):
               element.getparent().remove(element) 
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • I tested, got error: `Traceback (most recent call last): File "./clean.py", line 9, in tree = etree.parse(name) NameError: name 'etree' is not defined` Also, should be omitted the 1st import `from lxml import html` or 2nd? – Lexx Luxx Aug 24 '21 at 12:00
  • For the first question - my mistake; it was a typo - see edited version (changed `etree` to `lh`). As to the 2nd question - only need `import lxml.html as lh`. – Jack Fleeting Aug 24 '21 at 12:06
  • when tried, got [IOError](https://pastebin.com/raw/A3aEDgX4).The actual page example is in the top. – Lexx Luxx Aug 24 '21 at 12:54
  • @triwo I'm afraid I can't help with that. That has something to do with your files and directory structure which only you have access to. – Jack Fleeting Aug 24 '21 at 12:56
  • Ok.Possibly, this error is lxml etree alert, it happens when the file or directory not found. Then, how to make script insensitive to this? – Lexx Luxx Aug 24 '21 at 15:48
  • 1
    @triwo You probably should ask it as a separate question. It has nothing to do with lxml, xpath and html parsing, which are the tags in your question. – Jack Fleeting Aug 24 '21 at 15:53
  • @triwo An aside: from the IOError, I can see that you use Python 2.7. Python 2 is old and unmaintained. You should start using Python 3. – mzjn Aug 25 '21 at 10:01
  • @mzjn move to Python 3 **for what**? this is wrong. Stable script or program should work with any Python version for the convenience of users. And every person have his different goals and needs. – Lexx Luxx Aug 25 '21 at 10:40
  • 1
    This is off topic, but there is no good reason to use Python 2 these days (unless you have to maintain a legacy Python 2 codebase). – mzjn Aug 25 '21 at 10:47