C# / Python Encoding difference

Question

Basically I am doing some conversions of PDF's into text, then analyzing and clipping parts of that text using a library in Python. The Python "clipping" doesn't actually cut the text into separate files it just has a start character and end character position for string extraction. For example:

the quick brown fox jumped over the lazy dog

My python code might cut out "quick" by specifying 4 , 9. Then I am using C# for a GUI program and try to take these values assigned by Python, and it works... for the most part. It appears the optical character recognition program that turned the pdf into a text file included some odd UTF characters which will change the counts on the C# side.

The PDF-txt conversion odd characters characters include a "ﬁ" character, instead of an "f" and "i" character (possibly other characters too, they are large files.) Now this wouldn't be a problem, except C# says this is one character and Python (as well as Notepad++) consider this 3 character positions.

C#: "ﬁ" length = 1 character.

Python/Notepad++: "ﬁ" length = 3 characters.

What this ends up doing is giving me an offset clip due to the difference of character counts. Like I said when I run it in python (linux) and try outputting the clipping its perfect, and then I transferred the text file to Windows and Notepad++ confirms they are the correct positions. C# really just counts the "ﬁ" as one character and Notepad++ as well as Python count it as 3 characters for some reason.

I need a way to bridge this discrepancy from the Python side OR the C# side.

Daniel · Accepted Answer · 2015-07-18T13:39:53.787

1

You have to distinguish between characters and bytes. utf8 is a character encoding, where one character can have up to 4 bytes. So notepad++ displays probably byte positions, where Python can work with both byte and character strings. In C# probably have read the file as text file, which also produces character strings.

To read character strings in python use:

import codecs
with codecs.open(filename, encoding="utf-8") as inp:
    text = inp.read()

edited Jul 18 '15 at 13:39

answered Jul 18 '15 at 12:59

Daniel

42,087
4
55
81

In python I'm using f=open(file) str = f.read() then str.find() to get the locations, wont that be character string? – projectgonewrong Jul 18 '15 at 13:07

C# / Python Encoding difference

1 Answers1