1

I am trying different python libraries like pdftotree, pdfminer, tabula etc. But could not get the exact results. I mean I can get text from PDF, Images and Tabular data in HTML, but not as maintained and organized as original PDF file. Can someone help me with something regarding this? I would be thankful.

  • The answer is NO, because PDF does not contain the HTML tags that were present in the source HTML. – Błotosmętek Feb 18 '20 at 14:06
  • Google docs does it, gmail does it, it's entirely possible. PDF->SVG is not rocket science. – Mark Storer Feb 18 '20 at 15:41
  • What all is in these PDFs? scanned pages? Fancy PDF forms? – Mark Storer Feb 18 '20 at 15:52
  • Why is converting to HTML important for you? What do you do with the HTML afterwards? There may be a much better/easier way to accomplish your task, but you would need to elaborate. – Ryan Feb 26 '20 at 16:50
  • @mark-storer "Google docs does it, gmail does it, it's entirely possible. PDF->SVG is not rocket science" Google docs does a terrible job. If you upload a PDF to GDrive, and Open with Google Docs it makes a mess of the PDF (such as dropping all images). Unless you are talking about something else, which I would be very curious to know about. – Ryan Feb 26 '20 at 16:59

1 Answers1

0

Mostly yes. Translate the PDF to SVG, and embed the SVG in your web page.

SVG's image model (what it can represent and how) is a near-superset of the PDF image model (which is itself a superset of PostScript), though SVG lacks some of the print-specific features of PDF. There are probably quite a few PDF->SVG converters out there already. Googling "Pdf to SVG" turned up quite a few promising hits

There will be some complications:

  • Many PDF files are longer than 1 page. You might need to generate 10 SVG files for a single 10 page PDF file, and then build a web page around those 10 SVGs. Throw in some dynamic HTML to "turn pages" and you've got a good web-based PDF viewer.

  • There are parts of PDF that aren't within its image model at all... bookmarks, annotations (form fields, digital signatures), document metadata (author, creation date, etc), and so forth. Some of the non-image-model stuff is common enough that a PDF to SVG utility might handle it directly (links), while other stuff doesn't have an HTML equivalent and would be lost.

You could preserve the appearance of a digital signature, but the actual security represented by those visuals would be gone. Preserving that signature's appearance could be considered lying about the security.

Mark Storer
  • 15,672
  • 3
  • 42
  • 80
  • I think OP actually wants a HTML file that has actual `` s and so on (though it isn't obvious from the post), which is a hairier problem altogether.
    – AKX Feb 18 '20 at 15:47
  • I appreciate the help, SVGs works fine for me. I was trying to convert PDF to different formats like `xml` and `json` and then to `html`, that didn't workout properly. And I didn't think of converting it to SVG either. So, thanks for the help. – Ahsan Masood Feb 20 '20 at 07:54
  • @AKX yes, that is what I wanted, but that is not gonna work, unless I use some kinda neural networks to recreate all the PDF file along with generating html-tags for the page. – Ahsan Masood Feb 20 '20 at 07:55
  • 1
    "I believe SVG's image model (what it can represent and how) is a superset of the PDF image model" Unfortunately SVG does not cover everything that PDF does. Some examples are; a bunch of blend modes that PDF does and SVG does not support. Also CMYK and Overprint simulation. The only workaround for these is to statically convert the unsupported content to an image and put that in the SVG. – Ryan Feb 26 '20 at 16:48
  • You can convert from (your colorspace here) to RGB, it's just not easy to do well. If you care at all about color accuracy, then yes, @Ryan is absolutely correct. If not, a "ballpark" CMYK->RGB converter isn't all that hard, even if it's just sampling a bunch of values through someone else's CMYK->RGB and LERPing between them. (LERP: Linear intERPolation... "LERP" is more fun to say than "LININT" or "LI" or whatever. SLERP is even more entertainling (SpLine intERPolation) aloud). – Mark Storer Feb 26 '20 at 20:00
  • It'd be interesting to see what some of the different PDF->SVG converters do with a spot color. I'm guessing lots of "choke and die", with a sprinkling of "translate to RGB with varying degrees of accuracy", and a couple "nailed it" (Adobe, maybe a couple other commercial apps). – Mark Storer Feb 26 '20 at 20:07