Unable to properly decode utf-8 HTML

Question

I am scraping content off of websites. My Perl script that does the scraping uses the utf8 module. My script works, however, one site in particular giving me a weird issue where a handful of blank spaces are giving me the question mark in a diamond and I'm not sure how to fix it. When I pull up the webpage locally from the save HTML from the website, I see them. Example:

Extreme heat waves have already�resulted in testing sites throughout the country�closing or modifying their schedules.�The heat even damaged 400 tests in Washington, DC, in June.

Here is the actual page in question that I scraped: https://www.motherjones.com/politics/2020/08/a-hurricane-a-pandemic-and-trump-the-triple-crisis-is-barreling-down-on-florida/

My local web page with the content has the following:

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"></head>

...snip...

</html>

I'm writing the files like so with perl:

open my $out, '>',  $path_to_content;
print $out $content;
close $out;

Note that if I change this to:

open my $out, '>:encoding(UTF-8)',  $path_to_content;
print $out $content;
close $out;

the diamond/question mark character disappears but a lot of weird characters show up in the output file for this site and others (for example, curly quotes don't render properly).

By "My Perl script ... uses the `utf8` module" you mean the `utf8` pragma? Well then this may be of little help -- the [documentation](https://perldoc.perl.org/utf8.html) clearly states "**Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.**". — sticky bit, Aug 02 '20 at 03:10
1) the `utf8` pragma doesn't affect your file handles 2) you have to properly decode the input too (which is my first suspect). 3) Show us a minimal complete working program so we can see what you are doing. — brian d foy, Aug 02 '20 at 03:11
[Edit] the question and provide a [example]. Especially a runable script as a whole, not just snippets out of context. — sticky bit, Aug 02 '20 at 03:12
It appears this has something to do with the improper handling of the html entity. I tried using HTML::Entities to decode the scraped content but still no dice. — StevieD, Aug 02 '20 at 03:23
@stickybit I had to use the utf8 pragma to get the tr/// function to work properly on replacing curly quotes with regular quotes. It didn't work otherwise. — StevieD, Aug 02 '20 at 03:27
@StevieD: Presumably because the literals in your code needed to be UTF-8 to match. And that doesn't contradict my comment. But tough to tell, still with no [example]... — sticky bit, Aug 02 '20 at 03:34
This fixed it: https://stackoverflow.com/a/55604005/1641112. Some kind of bug in Mojo::DOM, the module I'm using to manipulate the scraped content. — StevieD, Aug 02 '20 at 04:10
`open my $out, '>:encoding(UTF-8)', $path_to_content;` means `$content` is expected to be a string of Unicode Code Points. Apparently, it's not, but you haven't provided enough to help us with your problem. — ikegami, Aug 02 '20 at 05:00

black blue · Answer 1 · 2020-08-02T12:43:49.710

This problem looks like the old ISO 8859-1 or Windows CP1252 encoding. This data need transcoding to utf-8. If you have this as a saved file, you can use the online converter. There's no other way. If you will be saving it as a file, you'd better create two files, one as ISO 8859 and the other as ANSI (win 1252), then convert. Saved wrong is irretrievably broken.

Edit ---
Precisely - irretrievably for regular human beings.
The '? ? ?' very questioning line - means text not in English or German.

Unable to properly decode utf-8 HTML

1 Answers1