Travis CI encodes ü as Ã¼

Question

I'm writing some Unicode strings to HTML in Python. The way I do it is to use Unicode internally and only encode when output. So something like:

with open(filename, 'w') as f:
    f.write(s.encode("utf-8"))

This works just as expect on my local machine. But when it's put on to Travis CI, the generated files have Ã¼ in place of ü. Any idea?

Here is my .travis.yml:

language: python
python: 2.7.10
install: pip install -r requirements.txt
script: python main.py -d
deploy:
  provider: s3
  access_key_id: XXX
  secret_access_key:
    secure: XXX
  bucket: www.my.org
  region: us-east-1
  skip_cleanup: true
  default_text_charset: 'utf-8'
  local-dir: output

Update

The minimal Python code that can reproduce the problem is following:

from pyquery import PyQuery as pq

argurl = 'http://hackingdistributed.com/tag/bitcoin/'

d = pq(url=argurl)

authors = []
for elem in d.find("h2.post-title a"):
    pubinfo = pq(elem).parent().parent().find(".post-metadata .post-published")
    author = pq(pubinfo).find(".post-authors").html().strip()
    authors.append(author)

with open('output/test.html', 'w') as f:
    f.write(': '.join(authors).encode('utf-8'))

Check out the output/test.html to see the Ã¼.

Could you attach the `.travis-ci.yml` that you are using along with a minimal Python example code? — tambre, Aug 25 '16 at 15:32
I suspect this is rather a bug in your Python code. Would you mind posting the minimal amount of Python code that reproduces this problem on Travis? — tambre, Aug 25 '16 at 17:43
@tambre Hi, tambre, I just uploaded a script that can reproduce the problem. — qweruiop, Aug 25 '16 at 20:31
I'm unable to reproduce it on 64-bit Python 2.7.12 using PyQuery 1.2.13 on Windows 10 Pro. All the ü's are displayed correctly. Make sure that your text editor supports UTF-8 and that the file is correctly decoded as UTF-8 with the editor/browser you're displaying it as. — tambre, Aug 26 '16 at 07:55
"The way I do it is to use Unicode internally" - are you sure? I'd assume that pyqyery in 2.7 uses the standard urllib libraries to read text which then wouldn't be unicode. So whether this works would depend on the host encoding. Apart from that your "html file" isn't valid html and doesn't contain encoding info so you're at the mercy of your editor. — Voo, Aug 26 '16 at 07:57
How are you reading your HTML file? The characters `Ã¼` are the Latin-1 decoding of the UTF-8 encoding of `ü`, so I suspect you're simply reading it with the wrong encoding after writing it correctly. — Blckknght, Aug 26 '16 at 07:58
Finally thought about viewing it in the browser, was able to reproduce using Chrome. Adding BOM seems to fix the problem in Chrome. — tambre, Aug 26 '16 at 08:50
@Blckknght I'm reading it through the browser with UTF-8. See OP. — qweruiop, Aug 26 '16 at 12:41
@tambre I can't reproduce it on my platform either unless set the browser to use Latin charset. Things **only** went wrong on **travis**. — qweruiop, Aug 26 '16 at 12:44
@Voo That might be a good point. `print type(author)` gives ``. Doesn't this mean the result from `pq` is Unicode? Also, the problem remains even `` is added to HTML (which is left out from this minimal example.) — qweruiop, Aug 26 '16 at 12:58
@Voo I now suspect `urlib` might be the problem. It seems like it's reading the original website as `Latin-1` and returns a UTF-8 encoded 'Latin-1' string. Will Python 3 makes this better? Do you have an idea of how to set urllib right? — qweruiop, Aug 26 '16 at 13:13
@qweruiop Trying your sample code with python 2.7 under Windows 10, the ü is correctly written as UTF-8. I'm pretty sure at this point that the editor you're using to display the file is using the wrong encoding. Try [notepad++](https://notepad-plus-plus.org/download/v6.9.2.html) if you aren't already. — Voo, Aug 26 '16 at 17:42
@Voo Everything works correctly on **my** laptop. The problem is with **travis**. — qweruiop, Aug 26 '16 at 17:59
@qweruiop So you're opening the files with the same editor in both situations? Interesting. I would try reading the webpage directly with urllib as bytes and then decoding explicitly with UTF8. You can then pass the string to pyquery and also try and see what happens if you save the content directly to file. Might be able to pinpoint the problem this way. Also I'd usually use `open(.., encoding='UTF8')` and then save the string directly without encode. The way you're doing it now, you should open it in binary mode I think (although not sure how that would matter). — Voo, Aug 26 '16 at 18:53
@Voo can you write a short demo using urllib? I didn't find how to get bytes. — qweruiop, Aug 26 '16 at 19:10
@Voo I used `requests` instead which is more convenient in Python2. Do you want to convert your comment about explicit decoding to an answer? — qweruiop, Aug 26 '16 at 20:20

score -1 · Accepted Answer · answered Aug 26 '16 at 08:49

-1

This seems to be because your browser is likely wrongly reading the file. Easiest fix to this is to encode it as UTF-8 BOM by adding the BOM marker to the start of the file.

Here's the fixed code for writing to the file:

with open('output/test.html', 'w') as f:
    f.write(u'\ufeff'.encode('utf-8')) # BOM marker
    f.write(': '.join(authors).encode('utf-8'))

answered Aug 26 '16 at 08:49

tambre

4,625
4
42
55

Not really. As said in the comment, now I suspect `urlib` might be the problem. It seems like it's reading the original website as `Latin-1` and returns a UTF-8 encoded 'Latin-1' string. Any idea? – qweruiop Aug 26 '16 at 13:15
@qweruiop You don't use `urllib` anywhere in your code...? – tambre Aug 26 '16 at 13:35
pyquery uses urblib. – qweruiop Aug 26 '16 at 14:37

Travis CI encodes ü as Ã¼

Update

1 Answers1