I do have a problem due to cups-PDF creating PDF documents where characters are mapped to strange symbols [on Ubuntu Linux 14.04 and 16.04}. I think its some kind of unicode even if Python is telling me its string type. type(object)
python returns "string"
No difference if I grab the text out of the PDF via Mouse copy paste from evince / Firefox or via Python PDFminer module. So its true, the PDF has broken text information which is rendered correct on PDF document itself. I did not know that, but text, and text-graphic on PDF document seem to be no bound very tight together.
When I do copy text from such created PDF document by example the name "Raphael" turns into "✡✍✑✒✍☛✓"
so each single character maps to "✡=R ✍=a ✑=p ✒=h ✍=a ☛=e ✓=l"
Another example is: "Devel"
turns into "✭☛✮☛✓"
How can I write a function in Python which shifts this "wrong" information to the correct one? On the PDF Document everything is perfectly readable.
This has something todo with cups-PDF using postscript to create the PDF but not adding the correct font/character information to the document.
If the letter 'l'
is always the Symbol '✓'
which is this checkmark unicode character
How can I do a remap of the characters in this strange representation to correct representation in Python? So how can I shift or remap symbol '✓'
to letter 'l'
? Any Idea?
Why I need this? I need to search for a text value in this documents.