I'm developing a simple tool that allows to extract relevant data from HTML files and write them in TXT files. So far, I've achieved most of what I had in mind, but the final result is still unusable because there are (lots of) lines consisting of only white spaces that keep getting transcribed into the final TXT files. I'll attach a picture of how one of the TXTs is looking like as of right now:
Ideally, I'd want all lines containing text to be consecutive. How do I ignore all the lines containing ONLY spaces (.i.e. containing no alphanumeric character) when reading the HTML file once I got rid of the etiquettes? (the spaces are the remainder after deleting everything in between "<" and ">" for the TXTs)