I want to use IcePDF
or PDFBox
to extract content from PDF. But I don't now the way to continue generating HTML
web pages from the text and images extracted.
-
What do you want to extract from which input? – mkl Dec 24 '12 at 23:58
-
i need to transform all pdf pages to HTML web pages with all data (text, images, grid..) – Tayba Dec 26 '12 at 08:40
1 Answers
You can convert pdf to html with PDFBox. Try this link.
By adding -html as parameter when you extract text, you will get html of the pdf. But it will not contain any image, graphics and other details. It will be only the text extracted from the pdf in html format.
If you want to create the exact look and feel of the pdf, there is no single step method in PDFBox. In my knowledge no library provides this facility to create exact html of the pdf. But using PDFBox you can extract images, text and its details. Using these details you have to create a logic to produce the html. We have done a project to convert pdf to html for azzist.com. We have accomplished the conversion using PDFBox. In azzist we are converting the resume to html format. (Still some font issues are there).
Scribd, google, dropbox, zoho etc have accomplished this conversion in a better way. You can have a look at any of these sites to check how they have accomplished this. (You will not get the logic. You have to find it out).

- 17,731
- 7
- 58
- 97

- 1,612
- 7
- 29
- 47
-
@chinna_82 I fixed the link. (Hopefully you didn't wait that long :-)) – Tilman Hausherr Sep 09 '15 at 19:35