I am scraping content off of websites. My Perl script that does the scraping uses the utf8
module. My script works, however, one site in particular giving me a weird issue where a handful of blank spaces are giving me the question mark in a diamond and I'm not sure how to fix it. When I pull up the webpage locally from the save HTML from the website, I see them. Example:
Extreme heat waves have already�resulted in testing sites throughout the country�closing or modifying their schedules.�The heat even damaged 400 tests in Washington, DC, in June.
Here is the actual page in question that I scraped: https://www.motherjones.com/politics/2020/08/a-hurricane-a-pandemic-and-trump-the-triple-crisis-is-barreling-down-on-florida/
My local web page with the content has the following:
<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"></head>
...snip...
</html>
I'm writing the files like so with perl:
open my $out, '>', $path_to_content;
print $out $content;
close $out;
Note that if I change this to:
open my $out, '>:encoding(UTF-8)', $path_to_content;
print $out $content;
close $out;
the diamond/question mark character disappears but a lot of weird characters show up in the output file for this site and others (for example, curly quotes don't render properly).