1

I want to use IcePDF or PDFBox to extract content from PDF. But I don't now the way to continue generating HTML web pages from the text and images extracted.

tshepang
  • 12,111
  • 21
  • 91
  • 136
Tayba
  • 148
  • 1
  • 2
  • 11

1 Answers1

2

You can convert pdf to html with PDFBox. Try this link.

By adding -html as parameter when you extract text, you will get html of the pdf. But it will not contain any image, graphics and other details. It will be only the text extracted from the pdf in html format.

If you want to create the exact look and feel of the pdf, there is no single step method in PDFBox. In my knowledge no library provides this facility to create exact html of the pdf. But using PDFBox you can extract images, text and its details. Using these details you have to create a logic to produce the html. We have done a project to convert pdf to html for azzist.com. We have accomplished the conversion using PDFBox. In azzist we are converting the resume to html format. (Still some font issues are there).

Scribd, google, dropbox, zoho etc have accomplished this conversion in a better way. You can have a look at any of these sites to check how they have accomplished this. (You will not get the logic. You have to find it out).

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
Neeraj
  • 1,612
  • 7
  • 29
  • 47