I have twelve long and complex html files I need to work with which I did not author. Some of these files are displaying properly. Others are not. I've been trying out Python and BeautifulSoup to explore these files. Searching here and elsewhere I can't find any examples of how to strip out and discard the content and keep the document tree.
EDIT: Another useful operation doesn't need to delete the text, but rather generate a list of tags in a file--the whitelist--which I can use to test against other files which are not displaying correctly. Again, looking through the methods in lxml's Cleaner seems only to want to delete tags and not to work to keep the structure.
EDIT: As you can see from this short list of only the > 0 scored questions on the converse operation, get content from HTML with Python, every one recommends BeautifulSoup, lxml, and other similar modules. And they're all noobie questions. I've read the Beautiful Soup documentation and tree extraction is not one of its methods. I was thinking someone with more experience using BS might be able to tell me, however, if one of the methods could be used as I need before I waste my time with these modules or if another module may be easier for my goal.
+1 A: Making a basic web scraper in Python with only built in libraries - Python newbie
+4 A: What is a light python library that can eliminate HTML tags?...
+1 How to get only text of a webpage with Python, just as Select-all & Copy in browser?
+2 Q: Extracting Text from Parsed HTML with Python
+2 A: Easy way to get data between tags of xml or html files in python?