1

I have a few PDF files which are in the Urdu language, and some of the PDF files are in the Arabic language.

I want to convert the PDF files to text format. I have issued the following Ghostscript code from the command line in my Windows 10 system:

gswin64c.exe -sDEVICE=txtwrite -o output.txt new.pdf

The text file is generated, however, the contents of text file is not in the Urdu language or Arabic language.

This is how it looks like (I have pasted a portion of output as it is huge):

ی첺جⰧ�� ہ셈ے

How can I properly convert PDF to text using Ghostscript?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Shahid
  • 65
  • 6
  • Portions of that look like Arabic, or possibly Urdu, the first, 7th and 9th glyphs there, though it is hard to be certain. Some of the other glyphs look like Chinese (glyphs 2 and 8 for example) which suggests to me that there is no ToUnicode CMap and the text extraction code is falling back to one of the heuristic methods, which are less reliable. However, as K J says, without looking at the content of the original PDF file it is impossible to say more. There is **no** guranteed method for extracting text from a PDF file. You could try using the newest Ghostscript pdfwrite with OCR. – KenS Apr 11 '22 at 17:29
  • @KenS I am attaching the links for two pdf files which is giving me different outputs. [Link to file 1](https://www.mediafire.com/file/lp7udr1y3eur0qe/new.pdf/file) [Link to file 2](https://www.mediafire.com/file/dyp167ucbe4x9ll/toc.pdf/file) I have installed the latest version of Ghostscript – Shahid Apr 11 '22 at 17:39

1 Answers1

1

Well basically the answer is that the PDF files you have supplied have 'not terribly good' ToUnicode CMap tables.

Looking at your first file we see that it uses one font:

26 0 obj
<<
  /BaseFont /CCJSWK+JameelNooriNastaleeq
  /DescendantFonts 28 0 R
  /Encoding /Identity-H
  /Subtype /Type0
  /ToUnicode 29 0 R
  /Type /Font
>>
endobj

That has a ToUnicode CMap in object 29, the ToUnicode CMap is meant to map character codes to Unicode code points. Looking at the first piece of text as an example we see:

/C2_0 1 Tf
13 0 0 13 39.1302 561.97 Tm
<0003>Tj
/Span<</ActualText<FEFF0645062A>>> BDC 
<38560707>Tj

So that's character code 0x003 (notice no marked content for the first characetr), looking at the ToUnicode CMap we see:

<0003> <0020>

So character code 0x003 maps to Unicode point U+0020, a space. The next two character codes are 3856 and 0707. Again consulting the ToUnicode CMap we see:

<3856> <062A0645>

So that single character code maps to two Unicode code points, U+062A and U+0645, Which is 'Teh' ت and 'Meem' م

So far so good. The next code is 0707, when we look at the ToUnicode CMap it comes up as 0xFFFD, which is the 'replacement character' �. Obviously that's meaningless.

We then have this :

0.391 0 Td
[<011C07071FEE>1 <0003>243.8 <2E93>]TJ
/Span<</ActualText<FEFF0644>>> BDC 
<0707>Tj
EMC 

So that's character codes 0x011C, 0x0707, 0x1FEE, 0x0003, 0x2E93 followed by 0x0707. Notice that the final <0707> is associated with a Marked Content definition which says the ActualText is Unicode 0x0644, which is the 'Lam' glyph ل

So clearly the ToUnicode CMap should be mapping 0707 to U+0644, and it doesn't.

Now when given a ToUnicode CMap the text extraction code trusts it. So your problem with this file is that the ToUnicode CMap is 'wrong', and that's why the text is coming out incorrect.

I haven't tried to debug further through the file, it is possible there are other errors.

Your second file has this ToUnicode CMap:

26 0 obj
<<
  /Length 606
>>
stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (AABACF+TT1+0) /Ordering (T42UV) /Supplement 0 >> def
/CMapName /AABACF+TT1+0 def
/CMapType 2 def
1 begincodespacerange <0003> <0707> endcodespacerange
15 beginbfchar
<0003> <0020>
<0011> <002E>
<00e7> <062A>
<00ec> <062F>
<00ee> <0631>
<00f3> <0636>
<00f8> <0641>
<00fa> <0644>
<00fc> <0646>
<00fe> <0648>
<0119> <0647>
<011a> <064A>
<0134> <0066>
<013b> <006D>
<0707> <2423>
endbfchar
2 beginbfrange
<00e4> <00e5> <0627>
<011f> <0124> <0661>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end

The first text in the file is:

<3718>Tj

And again, that's not in the CMap. Because the text extraction code prioritises the CMAp (because it's usually reliable), the missing entries cause the extraction to basically fail.

In addition to the fact that the ToUnicode CMaps are incorrect, the embedded fonts are subset and use an Identity-H CMap for drawing. That eliminates another source of information we could use.

Fundamentally the only way you're going to get text out of that PDF fie is manual transcription or OCR software.

Since you are using Ghostscript on Windows, the distributed binary includes Tesseract so you could try using that with pdfwrite and an Urdu training file to produce a PDF file with a possibly better ToUnicode CMap. You could then extract the text from that PDF file.

You would have to tell the pdfwrite device not to use the embedded ToUnicode CMaps, see the UseOCR switch documented here https://ghostscript.com/doc/9.56.1/VectorDevices.htm#PDF

And information on setting up the OCR engine and getting output here https://ghostscript.com/doc/9.56.1/Devices.htm#OCR-Devices

You may get better results by using an 'image' OCR output and then using the text extraction on that file to get the text out.

KenS
  • 30,202
  • 3
  • 34
  • 51
  • Thank you so much for your effort. I never though that text conversion would be so technical. I will follow the steps which you have mentioned. Summary of the solution if I have understood properly is that I have to generate another pdf using ghostscript and from that pdf I have to get the text out. – Shahid Apr 12 '22 at 14:04
  • Essentially, yes. Or of course you could just use Tesseract or another OCR solution directly on the original PDF file. Or render the PDF to an image (eg TIFF) and then run Tesseract over that. – KenS Apr 12 '22 at 14:52