-1

I have twelve long and complex html files I need to work with which I did not author. Some of these files are displaying properly. Others are not. I've been trying out Python and BeautifulSoup to explore these files. Searching here and elsewhere I can't find any examples of how to strip out and discard the content and keep the document tree.

EDIT: Another useful operation doesn't need to delete the text, but rather generate a list of tags in a file--the whitelist--which I can use to test against other files which are not displaying correctly. Again, looking through the methods in lxml's Cleaner seems only to want to delete tags and not to work to keep the structure.

EDIT: As you can see from this short list of only the > 0 scored questions on the converse operation, get content from HTML with Python, every one recommends BeautifulSoup, lxml, and other similar modules. And they're all noobie questions. I've read the Beautiful Soup documentation and tree extraction is not one of its methods. I was thinking someone with more experience using BS might be able to tell me, however, if one of the methods could be used as I need before I waste my time with these modules or if another module may be easier for my goal.

+1 A: Making a basic web scraper in Python with only built in libraries - Python newbie

+4 A: What is a light python library that can eliminate HTML tags?...

+1 How to get only text of a webpage with Python, just as Select-all & Copy in browser?

+2 Q: Extracting Text from Parsed HTML with Python

+2 A: Easy way to get data between tags of xml or html files in python?

xtian
  • 2,765
  • 7
  • 38
  • 65
  • 1
    Will you post your code? What exactly is the specific problem? – That1Guy Jan 23 '14 at 01:18
  • @That1Guy: Comparing HTML file structures. I'm actually working to convert the free nltk.org book into kindle document using pdfcrowd html to pdf converter. Several pages have readable line lengths of 45-50 characters and others are 60+ – xtian Jan 28 '14 at 01:04

1 Answers1

0

Maybe you can try out pyquery:https://pypi.python.org/pypi/pyquery

You can easily operate on the dom tree just as using jQuery

TwilightSun
  • 2,275
  • 2
  • 18
  • 27