0

I'm trying to extract text from PDF by converting PDF to HTML using Adobe Acrobat SDK and Python as Acrobat is the only tool that gives out the proper structure of the actual PDF. Some files are okay but in some files, one or two paragraphs leave out somehow, but, the exact paragraph in the pdf looks perfect. It would be great if someone sheds light on this, please.

My Python code to convert:

src = 'location to pdf file'
AvDoc = Dispatch("AcroExch.AVDoc")    
if AvDoc.Open(src, ""):            
    pdDoc = AvDoc.GetPDDoc()
    jsObject = pdDoc.GetJSObject()
    jsObject.SaveAs(filename+ ".html", "com.adobe.acrobat.html")

Sample PDF file:

20.pdf

Respective HTML file:

20.pdf.html

It's not happening in all PDFs. if you think it might be caused by an empty signature widget, all PDFs have them.

If you consider the '2.' point in the HTML, it is totally collapsed and out of the 'ol' tag which contains other points in a perfect structure.

Please help.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459

0 Answers0