1

I use PDFNet (version 9.308007) to convert pdf files into text format. Recently needed to upgrade from Ubuntu 16.04 to Ubuntu 20.04. The problem is that words changed order in output files when convert with PDFNet on Ubuntu 20.04. For ex.:

Ubuntu 16.04

'\r\n -$14,309.29\r\n Payment - 12/19/2022 - Thank You;

Ubuntu 20.04

'Payment - 12/19/2022 - Thank You -$14,309.29\r\n'

I need words order exactly as in first variant (Ubuntu 16.04). Will be very grateful if there will be at least some hints where to dig further.

dayz1
  • 13
  • 3
  • Welcome to StackOverflow, please provide the input pdf used and the command / procedure used to convert it in order to narrow down the problem – Caridorc Aug 07 '23 at 16:06
  • If the PDFNet SDK is identical (9.308007) on both systems, then most likely the issue is font substitution. You can either update the fonts on Ubuntu 20 with the exact same fonts on Ubuntu 16. Though it would be best to first establish that the PDF(s) in question have non-embedded fonts or not. Can you provide an example PDF with the issue (and screenshot clearly showing which page and where on the page the issue occurs). – Ryan Aug 08 '23 at 22:16
  • @Ryan, thanks a lot for your help! I substituted fonts as you advised with fonts from 16 ubuntu plus disabled all extra fonts, that weren't present in 16 and everything work now. If you want, add your answer and I will mark it as right one. Have a nice day! – dayz1 Aug 09 '23 at 09:19

1 Answers1

0

Assuming not all fonts in the PDF are embedded, then the issue is that there are different fonts installed on the two systems, and when PDFNet does font substitution (for the non-embedded font) these other fonts have different metrics and glyphs. This subtle difference in font metrics and glyphs can affect text run detection and result in different text extraction output.

Update the Ubuntu 20 system to have the same fonts as the Ubuntu 16 system and this should result in the same font substitutions and therefore same text extraction ordering.

Ryan
  • 2,473
  • 1
  • 11
  • 14