Basically I am doing some conversions of PDF's into text, then analyzing and clipping parts of that text using a library in Python. The Python "clipping" doesn't actually cut the text into separate files it just has a start character and end character position for string extraction. For example:
the quick brown fox jumped over the lazy dog
My python code might cut out "quick" by specifying 4 , 9. Then I am using C# for a GUI program and try to take these values assigned by Python, and it works... for the most part. It appears the optical character recognition program that turned the pdf into a text file included some odd UTF characters which will change the counts on the C# side.
The PDF-txt conversion odd characters characters include a "fi" character, instead of an "f" and "i" character (possibly other characters too, they are large files.) Now this wouldn't be a problem, except C# says this is one character and Python (as well as Notepad++) consider this 3 character positions.
C#: "fi" length = 1 character.
Python/Notepad++: "fi" length = 3 characters.
What this ends up doing is giving me an offset clip due to the difference of character counts. Like I said when I run it in python (linux) and try outputting the clipping its perfect, and then I transferred the text file to Windows and Notepad++ confirms they are the correct positions. C# really just counts the "fi" as one character and Notepad++ as well as Python count it as 3 characters for some reason.
I need a way to bridge this discrepancy from the Python side OR the C# side.