2

I'm writing some Unicode strings to HTML in Python. The way I do it is to use Unicode internally and only encode when output. So something like:

with open(filename, 'w') as f:
    f.write(s.encode("utf-8"))

This works just as expect on my local machine. But when it's put on to Travis CI, the generated files have ü in place of ü. Any idea?

Here is my .travis.yml:

language: python
python: 2.7.10
install: pip install -r requirements.txt
script: python main.py -d
deploy:
  provider: s3
  access_key_id: XXX
  secret_access_key:
    secure: XXX
  bucket: www.my.org
  region: us-east-1
  skip_cleanup: true
  default_text_charset: 'utf-8'
  local-dir: output

Update

The minimal Python code that can reproduce the problem is following:

from pyquery import PyQuery as pq

argurl = 'http://hackingdistributed.com/tag/bitcoin/'

d = pq(url=argurl)

authors = []
for elem in d.find("h2.post-title a"):
    pubinfo = pq(elem).parent().parent().find(".post-metadata .post-published")
    author = pq(pubinfo).find(".post-authors").html().strip()
    authors.append(author)

with open('output/test.html', 'w') as f:
    f.write(': '.join(authors).encode('utf-8'))

Check out the output/test.html to see the ü.

tambre
  • 4,625
  • 4
  • 42
  • 55
qweruiop
  • 3,156
  • 6
  • 31
  • 55
  • Could you attach the `.travis-ci.yml` that you are using along with a minimal Python example code? – tambre Aug 25 '16 at 15:32
  • @tambre Please see my edits. – qweruiop Aug 25 '16 at 16:57
  • What Python version are you using locally? – tambre Aug 25 '16 at 17:36
  • @tambre Python 2.7.10 – qweruiop Aug 25 '16 at 17:37
  • I suspect this is rather a bug in your Python code. Would you mind posting the minimal amount of Python code that reproduces this problem on Travis? – tambre Aug 25 '16 at 17:43
  • @tambre Hi, tambre, I just uploaded a script that can reproduce the problem. – qweruiop Aug 25 '16 at 20:31
  • I'm unable to reproduce it on 64-bit Python 2.7.12 using PyQuery 1.2.13 on Windows 10 Pro. All the ü's are displayed correctly. Make sure that your text editor supports UTF-8 and that the file is correctly decoded as UTF-8 with the editor/browser you're displaying it as. – tambre Aug 26 '16 at 07:55
  • "The way I do it is to use Unicode internally" - are you sure? I'd assume that pyqyery in 2.7 uses the standard urllib libraries to read text which then wouldn't be unicode. So whether this works would depend on the host encoding. Apart from that your "html file" isn't valid html and doesn't contain encoding info so you're at the mercy of your editor. – Voo Aug 26 '16 at 07:57
  • How are you reading your HTML file? The characters `ü` are the Latin-1 decoding of the UTF-8 encoding of `ü`, so I suspect you're simply reading it with the wrong encoding after writing it correctly. – Blckknght Aug 26 '16 at 07:58
  • Finally thought about viewing it in the browser, was able to reproduce using Chrome. Adding BOM seems to fix the problem in Chrome. – tambre Aug 26 '16 at 08:50
  • @Blckknght I'm reading it through the browser with UTF-8. See OP. – qweruiop Aug 26 '16 at 12:41
  • @tambre I can't reproduce it on my platform either unless set the browser to use Latin charset. Things **only** went wrong on **travis**. – qweruiop Aug 26 '16 at 12:44
  • @qweruiop Does adding the BOM help as per my answer? – tambre Aug 26 '16 at 12:55
  • @Voo That might be a good point. `print type(author)` gives ``. Doesn't this mean the result from `pq` is Unicode? Also, the problem remains even `` is added to HTML (which is left out from this minimal example.) – qweruiop Aug 26 '16 at 12:58
  • @Voo I now suspect `urlib` might be the problem. It seems like it's reading the original website as `Latin-1` and returns a UTF-8 encoded 'Latin-1' string. Will Python 3 makes this better? Do you have an idea of how to set urllib right? – qweruiop Aug 26 '16 at 13:13
  • @qweruiop Trying your sample code with python 2.7 under Windows 10, the ü is correctly written as UTF-8. I'm pretty sure at this point that the editor you're using to display the file is using the wrong encoding. Try [notepad++](https://notepad-plus-plus.org/download/v6.9.2.html) if you aren't already. – Voo Aug 26 '16 at 17:42
  • @Voo Everything works correctly on **my** laptop. The problem is with **travis**. – qweruiop Aug 26 '16 at 17:59
  • @qweruiop So you're opening the files with the same editor in both situations? Interesting. I would try reading the webpage directly with urllib as bytes and then decoding explicitly with UTF8. You can then pass the string to pyquery and also try and see what happens if you save the content directly to file. Might be able to pinpoint the problem this way. Also I'd usually use `open(.., encoding='UTF8')` and then save the string directly without encode. The way you're doing it now, you should open it in binary mode I think (although not sure how that would matter). – Voo Aug 26 '16 at 18:53
  • @Voo can you write a short demo using urllib? I didn't find how to get bytes. – qweruiop Aug 26 '16 at 19:10
  • @Voo I used `requests` instead which is more convenient in Python2. Do you want to convert your comment about explicit decoding to an answer? – qweruiop Aug 26 '16 at 20:20

1 Answers1

-1

This seems to be because your browser is likely wrongly reading the file. Easiest fix to this is to encode it as UTF-8 BOM by adding the BOM marker to the start of the file.

Here's the fixed code for writing to the file:

with open('output/test.html', 'w') as f:
    f.write(u'\ufeff'.encode('utf-8')) # BOM marker
    f.write(': '.join(authors).encode('utf-8'))
tambre
  • 4,625
  • 4
  • 42
  • 55