1

I try to write a HTML-parser but during testing I do not want to query the website every time so I saved the website as HTML-file locally.

For reading I use:

urltext = urllib.request.urlopen(urlfile).read().decode("utf-8")

from the website directly I get a correct stringto parse but when I open it from my local pc it seems to have a wrong decoding:

<span id="line845"></span>                          </span><span>&lt;<span class="start-tag">h2</span> <span class="attribute-name">class</span>="<a class="attribute-value">article-title</a>"&gt;</span><span>
<span id="line846"></span>                                          </span><span>&lt;<span class="start-tag">span</span> <span class="attribute-name">class</span>="<a class="attribute-value">headline-intro</a>"&gt;</span><span>Intro:</span><span>&lt;/<span class="end-tag">span</span>&gt;</span><span> </span><span>&lt;<span class="start-tag">span</span> <span class="attribute-name">class</span>="<a class="attribute-value">headline</a>"&gt;</span><span>Main text</span><span>&lt;/<span class="end-tag">span</span>&gt;</span><span></span><span>&lt;/<span class="end-tag">h2</span>&gt;</span><span>

originally it should look like this:

<h2 class="article-title">
                                            <span class="headline-intro">Intro:</span> <span class="headline">Main Text</span></h2>

Any ideas what I do wrong?

Thanx

Kev

Kev
  • 557
  • 1
  • 7
  • 26
  • if you manually open the file in notepad, which version does it look like? – Woodrow Barlow Jul 12 '16 at 18:16
  • In gedit (or I guess also in Notepad) it has the wrong Version. if I open it in Libre Office it is fine. – Kev Jul 13 '16 at 07:41
  • it sounds like you opened the website's source code then copy-pasted that into libre office, then saved the file as HTML. am i correct? that doesn't work. HTML is a plain-text format, and libre office creates rich-text files (i.e., including font information, text colors, etc.). the weird "extra" stuff you're seeing is that extra rich text formatting. – Woodrow Barlow Jul 13 '16 at 14:42

1 Answers1

3

You downloaded the HTML file incorrectly, but your method of opening it looks correct.

It sounds like you opened the web page's source code in your browser, copy-pasted that into Libre Office, and used Libre Office's "Save as HTML" feature. This won't work, because HTML is a plain-text markup format and Libre Office is a rich-text word processor -- that means Libre Office saves information like font, size, color, tecorations, images, etc. right in the file.

The "Save as HTML" feature in Libre Office is meant to convert a normal document into a webpage -- not to save HTML markup that you typed into the document.

In order to download a document the proper way, find your browser's "save" functionality. In most browsers, you can just press Ctrl / Cmd + S. When you're finished, open the file in a plain-text editor (such as Notepad, Gedit, or TextEdit) to be sure it looks as expected.

Woodrow Barlow
  • 8,477
  • 3
  • 48
  • 86