-1

I am scraping content off of websites. My Perl script that does the scraping uses the utf8 module. My script works, however, one site in particular giving me a weird issue where a handful of blank spaces are giving me the question mark in a diamond and I'm not sure how to fix it. When I pull up the webpage locally from the save HTML from the website, I see them. Example:

Extreme heat waves have already�resulted in testing sites throughout the country�closing or modifying their schedules.�The heat even damaged 400 tests in Washington, DC, in June. 

Here is the actual page in question that I scraped: https://www.motherjones.com/politics/2020/08/a-hurricane-a-pandemic-and-trump-the-triple-crisis-is-barreling-down-on-florida/

My local web page with the content has the following:

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"></head>

...snip...

</html>

I'm writing the files like so with perl:

open my $out, '>',  $path_to_content;
print $out $content;
close $out;

Note that if I change this to:

open my $out, '>:encoding(UTF-8)',  $path_to_content;
print $out $content;
close $out;

the diamond/question mark character disappears but a lot of weird characters show up in the output file for this site and others (for example, curly quotes don't render properly).

StevieD
  • 6,925
  • 2
  • 25
  • 45
  • 1
    By "My Perl script ... uses the `utf8` module" you mean the `utf8` pragma? Well then this may be of little help -- the [documentation](https://perldoc.perl.org/utf8.html) clearly states "**Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.**". – sticky bit Aug 02 '20 at 03:10
  • 1) the `utf8` pragma doesn't affect your file handles 2) you have to properly decode the input too (which is my first suspect). 3) Show us a minimal complete working program so we can see what you are doing. – brian d foy Aug 02 '20 at 03:11
  • [Edit] the question and provide a [example]. Especially a runable script as a whole, not just snippets out of context. – sticky bit Aug 02 '20 at 03:12
  • It appears this has something to do with the improper handling of the   html entity. I tried using HTML::Entities to decode the scraped content but still no dice. – StevieD Aug 02 '20 at 03:23
  • @stickybit I had to use the utf8 pragma to get the tr/// function to work properly on replacing curly quotes with regular quotes. It didn't work otherwise. – StevieD Aug 02 '20 at 03:27
  • @StevieD: Presumably because the literals in your code needed to be UTF-8 to match. And that doesn't contradict my comment. But tough to tell, still with no [example]... – sticky bit Aug 02 '20 at 03:34
  • This fixed it: https://stackoverflow.com/a/55604005/1641112. Some kind of bug in Mojo::DOM, the module I'm using to manipulate the scraped content. – StevieD Aug 02 '20 at 04:10
  • `open my $out, '>:encoding(UTF-8)', $path_to_content;` means `$content` is expected to be a string of Unicode Code Points. Apparently, it's not, but you haven't provided enough to help us with your problem. – ikegami Aug 02 '20 at 05:00

1 Answers1

1

This problem looks like the old ISO 8859-1 or Windows CP1252 encoding. This data need transcoding to utf-8. If you have this as a saved file, you can use the online converter. There's no other way. If you will be saving it as a file, you'd better create two files, one as ISO 8859 and the other as ANSI (win 1252), then convert. Saved wrong is irretrievably broken.

Edit ---
Precisely - irretrievably for regular human beings.
The '? ? ?' very questioning line - means text not in English or German.

black blue
  • 798
  • 4
  • 13