0

I read various lines from a CSV file like this:

f1 = open(current_csv, 'rb')
table = f1.readlines()
f1.close()

So essentially any single line in table is something like this:

line = b' G\xe4rmanword:           123,45\r\n'

which type tells me is bytes, but I need to work around with .replace so I'm turning it into a string: line = str(line), but now line turned into

"b' G\\xe4rmanword:           123,45\\r\\n'"

with and added \ before every \. However, with print(line), they don't show up, but if I want to turn \xe4 into ae (alternative way of writing ä) with line = line.replace('\xe4', 'ae') this just does nothing. Using '\\xe4' works, however. But I would have expected that the first one just turns \\xe4 into \ae instead of just doing nothing, and the second option, while working, relies on my defining a new definition for the replacement for ä, both of which I'd rather avoid.

So I'm trying to understand where the extra backslash comes from and how I can avoid it to start with, instead of having to fix it in my postprocessing. I have the feeling that something changed between python2 and 3, since the original csv reader is a python2 script I had translated with 2to3.

JC_CL
  • 2,346
  • 6
  • 23
  • 36
  • 1
    Don't do this: `line = str(line)`. You want to *decode* the `bytes` object into a `str` object, passing it to the `str` object constructor just gives you the *string representation of the bytes object*, which is not what you want. You probably should just open the file in text mode, so `f1 = open(current_csv, 'r')` so `'r'` instead of `'rb'` – juanpa.arrivillaga Feb 13 '20 at 10:06
  • 1
    Yes, there are changes between Python 2 and 3. You can read those related to strings them [here][1] [1]: https://medium.com/better-programming/strings-unicode-and-bytes-in-python-3-everything-you-always-wanted-to-know-27dc02ff2686 – YamiOmar88 Feb 13 '20 at 10:08
  • @L3viathan yep, my bad – juanpa.arrivillaga Feb 13 '20 at 10:08
  • I think I actively chose `rb` since `r` can't deal with umlauts like ä, which I'm dealing with later on (which now fails). – JC_CL Feb 13 '20 at 10:19
  • 2
    @JC_CL um, yes, `'r'` can deal with that just fine, you just need to provide it the correct encoding. so `f1 = open(current_csv, 'r', encoding='latin1')` – juanpa.arrivillaga Feb 13 '20 at 10:26
  • You're right. that's probably the correct way to to it, but for working with the historical thing, the answer also works. – JC_CL Feb 13 '20 at 11:17

1 Answers1

3

Yes, since Python3 uses Unicode for all strings, the semantics of many string-related functions including str have changed compared to Python2. In this particular case, you need to use second argument to str providing the encoding used in your input bytes value (which, judging from the use of German language, is 'latin1'):

unicode_string = str(line, 'latin1')

Alternatively you can do the same using

unicode_string = line.decode('latin1')

And you'd probably want the \r\n removed, so add .rstrip() to that. Besides, a more elegant solution for reading the file is:

with open(current_csv, 'rb') as f1:
    table = f1.readlines()

(so no need for close())

Błotosmętek
  • 12,717
  • 19
  • 29
  • "(which, judging from the use of German language, is 'latin1')" that's not really a reasonable inference. utf8 etc handles German language characters just fine, it's simply a different encoding. – juanpa.arrivillaga Feb 13 '20 at 10:30
  • @juanpa.arrivillaga it is reasonable. The charset in question obviously is **not ** UTF-8 nor any other kind of Unicode encoding, it is one of legacy 8-bit encodings. And for German the most likely 8-bit encoding is 'latin1' (ISO/IEC 8859-1) or possibly 'cp1252' a.k.a. 'windows-1252', but most certainly not, for example, 'latin2' – Błotosmętek Feb 13 '20 at 10:47
  • Sure, my point was that simply "German => latin1" isn't a good inference. – juanpa.arrivillaga Feb 13 '20 at 10:50
  • Thanks, that works. In my case, according to `file` it's `ISO-8859 text, with CRLF line terminators` which isn't really clear, but `cp1252` works for me. Also, Yamila Omars link in the comments to question was very helpful in understanding what happened. – JC_CL Feb 13 '20 at 11:20