0

I'm downloading a webpage from within R using xml2, and then using pandoc to convert it to pdf.

My R code

library(xml2)
download_html("https://thehustle.co/apple-christmas-present", "test.html")

cmd line

pandoc test.html -o converted.pdf

This fails with the error

pandoc: Cannot decode byte '\xf9': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream

I'm not sure whats going on here. If the webpage is not in utf-8 (and that is the root error), is there someway to convert it to?

Conor Neilson
  • 1,026
  • 1
  • 11
  • 27
  • Note that I've also tried to run this conversion using `wkhtmltopdf`, and this failed with `Exit with code 1 due to network error: ProtocolUnknownError`, so probably not just a pandoc error – Conor Neilson Apr 29 '21 at 08:38

1 Answers1

1

There are two things you could try:

  1. Use pandoc directly: don't use a different tool, but let pandoc handle the download.

    pandoc https://thehustle.co/apple-christmas-present -f html -o converted.pdf
    
  2. Use iconv to ensure that the input is really UTF-8 encoded:

    iconv -t utf-8 test.html | pandoc -f html -o converted.pdf
    

If neither works then it's likely to be a problem with the website.

tarleb
  • 19,863
  • 4
  • 51
  • 80