0

I need to convert PDF files to HTML.

I can do this manually via several steps, using this (Rube) Goldberg variation:

0) Save PDF as text
1) Copy-and-paste text into MS Word
2) Save MS Word doc as HTML

I feel like I'm walking on my hands doing that, though.

Is there a programmatic way to accomplish the same? So that I could do something like:

string htmlFile = ConvertPDFToHTML("FrumiousBandersnatch.PDF");
B. Clay Shannon-B. Crow Raven
  • 8,547
  • 144
  • 472
  • 862
  • 1
    Do you only want the text from the PDF file, or are you looking to preserve the layout, images, vector content, typesetting, formatting, etc? Your current method seems only interested in the raw text... – J... Feb 13 '14 at 18:59
  • Yes, all I want is the HTML with paragraph tags intact, so I can stuff paragraph content into a generic List of String. Formatting and images, etc. will be ignored. – B. Clay Shannon-B. Crow Raven Feb 13 '14 at 19:40
  • 1
    What extras do you hope to add by copying plain text into MS Word? It seems to me that if you only want the `

    ` tags added, you don't need that step. Adding `

    `s around lines may be simpler with a utility such as `sed`. You could try [`pdftotext`](http://en.wikipedia.org/wiki/Pdftotext) for the first step.

    – Jongware Feb 13 '14 at 21:24
  • All I want is the text decorated with p tags. I'll czech out your link, thanks! – B. Clay Shannon-B. Crow Raven Feb 13 '14 at 21:33
  • In general, I will be working with HTML files which already are HTML files, but in some cases I have to first convert something else (specifically, PDF) to HTML, thus my question. – B. Clay Shannon-B. Crow Raven Feb 13 '14 at 21:39

0 Answers0