Hindi content is distorted when copied from PDF file

Question

Whenever I am trying to copy Hindi content from any resource, the characters are distorted. I tried to copy to the browser, MS Word, text files etc. I am using Acrobat DC.

For example, In the attached file, when I copy the content of page 3 (in Hindi), the characters are changed.

is changed to संिैधावनक

is changed to ईपचारों

I tried with many libraries, tried to convert content with inbuilt export tools, using copy/paste, using wizards, changed encoding/language etc. but none of them worked. I also tried to build a few scripts, installed language packs, use OCR after converting to an image but none of them worked.

Can you guide what can be the potential way to resolve this issue?

Link of file https://www.dropbox.com/s/ujbt7d2aidqg8r4/Vision%20IAS%20Prelims%202019%20Test%201%20%5BHindi%20Medium%5D.pdf?dl=0

KenS · Answer 1 · 2018-11-28T09:53:46.717

For the Stack Overflow rules lawyers; I know this isn't a complete answer, but its too long for a comment.

As a non-speaker of the language, its rather difficult for me to identify differences here. There's quite a lot of text, and while I can see that the fonts are different, its not clear to me that the individual glyphs are. Can you point to one specific glyph there that is incorrect after it is copied ?

The font embedded in the file (Arial Unicode MS) has an attached ToUnicode CMap which looks correct to me, however several of the single character codes map to multiple Unicode code points. Eg character code 0x564 maps to the Unicode values 0x093e, 0x0901.

I have no way to tell easily if this is correct. I could laboriously decode the entire string, check to see what the Unicode code points are and then try and match those to the characters in the original file by placing them individually in a Word document, using Arial Unicode MS. But it looks to me like an awful lot of the characters are correct, and I don't want to waste a lot of time doing that.

[edit]

So this is what the text in the PDF file looks like. The character code is the actual character code in the PDF file, that maps to a glyph program in the font via the CMap and other parts of the font machinery that we needn't worry about here. It also maps via the ToUnicode CMap to a set of Unicode code points

code Unicode glyph name

059A            0938            स       Sa
0565            0902            ं       vowel sign Anusvara
0597            093F            ि       vowel sign I
05A8            0948            ै       vowel sign Ai
0589            0927            ध       Dha
059E            093E            ा       vowel sign Aa
059F            0935            व       Va
058A            0928            न       Na
0577            0915            क       Ka

Doing my best to recall how to read Devanagri, I believe the original is something like Sa (with 'am' diacritic) Va (with the ai vowel modifier) Dha (with the Aa vowel modifier) Na (with the i vowel modifier) and finally Ka.

I'm afraid that the reason this doesn't cut and paste properly is simply because the ToUnicode values seem to be partially incorrect. The character code 0x0597 has been assigned the Unicode value U+093F when it should be U+0935 and the character code 0x059f has been assigned the Unicode code point U+0935 when it should be U+093F. That is the Unicode values of those two character codes have been transposed.

When you copy and paste this you end up with incompatible modifiers, which is why you get the funny characters. The dashed ring in the glyph indicates where the character being modified by the accent should be. You should never see this, but because the layout engine can't find a base character to modify it just draws the accent on its own.

I'm afraid that your PDF file has been badly made, the only way to fix this would be to correct the errors in the ToUnicode CMap. I did do this for the two characters I noted above, and this then copies and pastes as :

संवैधाषनक

Which looks more or less correct (I seem to have made an error with one vowel modifier). However there may be other faults in that table, and its very much non-trivial to try and correct it. Its taken me the best part of a couple of hours to work out that problem, verifying the entire CMap would take me a day or two. And that CMap is particular to this document, I couldn't use it elsewhere because the font is a subset. A different document would have a different subset, which would mean the character codes would be different.

Your answer rings some bells for me. It should be a duplicate question but I cannot find it (possibly it got deleted, as it wasn't a stellar question anyway – much like this one, actually) — Jongware, Nov 24 '18 at 17:07

Hindi content is distorted when copied from PDF file

1 Answers1

code Unicode glyph name