Python umlaut character issue - mbcs needed , is there a better way for all characters

Question

I was having trouble with Python script opening a file which contained an umlaut character. Naturally I thought I could correct this with a unicode utf8 fix, but not so...

I ended up using the mbcs ( default is cp1252)

Then I wrote this statement of which I wish to write MUCH cleaner,

def len(fname):
i = -1
try:
    with open(fname, encoding='mbcs') as f:
        for i, l in enumerate(f):
            pass
except UnicodeDecodeError:
    try:
        i = -1
        with open(fname, encoding='utf8') as f:
            for i, l in enumerate(f):
                pass
    except UnicodeDecodeError:
        i = -1
        with open(fname) as f:
            for i, l in enumerate(f):
                pass
return i + 2 # 2 because it starts at -1 not 0

What encoding does your file have? Maybe you should simply use that one. — Hyperboreus, Nov 05 '13 at 19:32
Seriously, @Hyperboreus is definitely right. Figure out the file's encoding. — Paulo Bu, Nov 05 '13 at 19:36
Also this python [unicode page](http://docs.python.org/2/howto/unicode.html) is a great primer in file encoding. — RickyA, Nov 05 '13 at 19:41
And btw "mbcs" is not an encoding, but stands for "multi-byte character set". From the docs: "on Windows, Python uses the name “mbcs” to refer to whatever the currently configured encoding is" — Hyperboreus, Nov 05 '13 at 19:41
@Hyperboreus: Or, rather, `"mbcs"` is an encoding, but one that's not known until runtime, and is in fact _never_ known by the programmer or the script, only by Windows itself… — abarnert, Nov 05 '13 at 19:47
@abarnert The idea that there is wisdom unbeknownst to man, and only known by Windows itself, scares me. — Hyperboreus, Nov 05 '13 at 19:48
@Hyperboreus: This is one of those cases where knowledge does not necessarily imply wisdom. :) — abarnert, Nov 05 '13 at 19:50
Data is originally in Teradata Warehouse, sent to SQL Server stored in varchar — Tom Stickel, Nov 05 '13 at 21:11
@TomStickel And what is the encoding of that varchar column? — Hyperboreus, Nov 06 '13 at 00:01

abarnert · Accepted Answer · 2013-11-05T19:56:17.767

You're almost certainly going about this all wrong, as explained in the comments… but if you really do need to do something like this, here's how to simplify it:

The general solution to avoid repeating yourself is to use a loop. You've got the same code three times, with the only difference being the encoding, so loop over three encodings instead. (In your case, the third loop didn't pass an encoding at all, so you do have to know the default value of the parameter, but the docs or help will tell you that.) The only wrinkle is that you apparently don't want to handle exceptions in the third case; the easiest way to do that is to reraise the last exception if they all fail.

While we're at it: There's no need to "declare" i up-front the way you do; the for loop is just going to start at 0 and erase whatever you put there. That also means the +2 at the end is wrong. But there's an easier way to get the length of an iterable in the first place: just feed it into something that consumes generator expressions. A custom ilen function written in C would be ideal, but people have tested various different Python implementations, and sum(1 for _ in iterable) is almost as fast as the perfect solution, and dead simple, so it's the most common idiom. If this isn't obvious you to, factor it out as a function and call it lien, and give it a nice docstring and/or comment. Or just pip install more-itertools and then you can just call more_itertools.ilen(f).

Anyway, putting it all together:

def len(fname):
    for encoding in 'mbcs', 'utf8', None:
        try:
            with open(fname, encoding=encoding) as f:
                return sum(1 for line in f)
        except UnicodeDecodeError as e:
            pass
    raise e

I had to make a small modification since the raise fails due to the e variable losing scope. Thus def len(fname): encoding = ['mbcs', 'utf8', None] for enc in encoding: try: with open(fname, encoding=enc) as f: return sum(1 for line in f) except UnicodeDecodeError as e: if enc == encoding[-1]: raise e else: pass — Tom Stickel, Nov 05 '13 at 22:04
@TomStickel: If you're going to do it that way, just use `raise` inside the `except`, not `raise e`. It's a bit hard to explain why in a comment, but the short version is: if you can re-raise from an except, you should, especially in Python 3 but even in Python 2. — abarnert, Nov 05 '13 at 22:53

Robert Siemer · Answer 2 · 2013-11-08T08:05:07.543

1

It’s not entirely clear to me what you want: if you just want to count the lines, ignore the errors! – This is pretty safe, as practically all encodings use the same ASCII compatible line endings (except UTF-16...).

open(fname, errors='ignore')

And you never get an exception. Done.

edited Nov 08 '13 at 08:05

answered Nov 05 '13 at 20:41

Robert Siemer

32,405
11
84
94

I still need it to fail for example if a file does not exist. Or if someone tries to transmit a binary file. – Tom Stickel Nov 05 '13 at 21:14
@TomStickel: What exactly is a "binary file"? That's not a rhetorical question; your existing code (and my answer) says that anything that isn't valid as a text file in the current Windows MBCS codepage, or UTF-8, or the OEM codepage is binary. That means that, e.g., UTF-16 text files (which are pretty common on Windows) will be treated as binary. On the other hand, if the OEM codepage is one of the extended-Latin ones, almost _nothing_ will be treated as binary. – abarnert Nov 05 '13 at 22:51
@TomStickel: 1) It does fail if the files does not exist, of course! 2) so you want to use 3 character encodings to test if it’s a text-file or not?? – Robert Siemer Nov 08 '13 at 08:12

Python umlaut character issue - mbcs needed , is there a better way for all characters

2 Answers2