0

I am trying to convert a PDF file to an image format (ideally PNG), but some of the table lines do not render in the output, which is an issue since the purpose of my conversion is to use computer vision on it.

I unfortunately do not have access to the file used to generate the PDF.

Thank you in advance for your help!

Attached is the ghostscript rendering vs the actual pdf:

Original GhostScript

EDIT: Thanks for the answers. Here is what I had already tried:- ---

  • Changing the scaling & Changing the Antialiasing (I doubt that any combination of this will work in Ghostscript at this point)

  • Converting to PostScript and then to PNG/PDF

  • Saving from a Browser

  • Saving from various virtual printers to PDF

  • Using Poppler to do the rendering

All to no avail. Digging deeper, I found some interesting things which may be helpfull. Ghostscript does recognize the lines when using -sDevice=X11 and -sDevice=PS2Write (apologies for coding typos). That is, using Ghostscript to visualize the PDF does work, but not to process them into anything else than Postscript.

Also, printing into a PDF from Adobe Acrobat does fix my problem, however this is something that I need to be able to do from the command line on thousands of files.

Hope this helps!

EDIT2:

Link to a concerned file

https://transfer.sh/PuIF90/e176ad9824ddc6cb5e6aead2d389c131-filer.pdf

Sh4yce
  • 9
  • 3
  • A line width of 0 means '1 pixel wide' so it should never disappear. However, a very narrow line width rendered at low resolution **might** disappear if the line when rendered to the on the device failed to touch the centre of the pixel. Determining if that is the case might be tricky and you don't say what command line you have used. You could try; rendering at a higher resolution or anti-aliasing using either -dDownScaleFactor or -dGraphicsAlphaBits (prefer DownScaleFactor) See https://ghostscript.com/doc/current/Devices.htm#PNG – KenS Jul 21 '22 at 18:05
  • The reason for image route is that this is on a project to use camelot python to extract tabular data from the PDF. That said, camelot uses CV to find the lines of the table and reconstruct the matrix (grid?) of the original table, and using the text afterward to parse the content of the individual cells. As you can imagine, Camelot is merging a few columns together since a few lines are missing. Let me know if you have any idea of what direction to look into. – Sh4yce Jul 22 '22 at 00:39
  • I thought that I would share the fix that I found. Turns out that a bunch of the pdf we need to process were generated using a specific HTML5 to PDF conversion tool which turns each lines of the PDF into a rectangle with size 0. Solution for me has been to automate decompressing PDFs, and looking through the text file for "A A A A re", with all "A's" being numbers. Should the last or next to last A be a zero, I change it to size 1. Hope this helps someone else out there and let me know if there is a more elegant way of doing this – Sh4yce Jul 27 '22 at 20:50

1 Answers1

0

I thought that I would share the fix that I found. Turns out that a bunch of the pdf we need to process were generated using a specific HTML5 to PDF conversion tool which turns each lines of the PDF into a rectangle with size 0. Solution for me has been to automate decompressing PDFs, and looking through the text file for "A A A A re", with all "A's" being numbers. Should the last or next to last A be a zero, I change it to size 1.

For instance (once again, after decompressing the PDF):

1000 2000 0 14 re to 1000 2000 1 14 re

Hope this helps someone else out there and let me know if there is a more elegant way of doing this, I am still a beginner about all things PDF.

Sh4yce
  • 9
  • 3