Questions tagged [pdf2htmlex]

pdf2htmlEX renders PDF files in HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display.

pdf2htmlEX renders PDF files in HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display.

pdf2htmlEX is best for text-based PDF files, for example scientific papers with complicated formulas and figures. Text, fonts and formats are natively preserved in HTML such that you can still search and copy. Math formulas, figures and images are also supported. The generated HTML file is static, with optional features powered by JavaScript.

pdf2htmlEX is also a publishing tool, almost 50 options make it flexible for many different use cases: PDF preview, book/magazine publishing, personal resume...

Useful links:

31 questions
1
vote
0 answers

How to get sticky notes attached to pdf documents while using pdf2htmlEx tool?

Used the option --process-annotation 1 to view annotations in pdf documents This works fine for Highlight Underline Strikethrough Rectangular box And not for Notes added in Sticky notes - the converted html contains only note icon - missing…
Tom Taylor
  • 3,344
  • 2
  • 38
  • 63
1
vote
0 answers

Extract all content from PDF file (not just text, but also tables/diagrams)?

I'd like to reformat PDF main content, so I need to extract its main content, not just text, but also tables, diagrams, etc. with their layout information. I'm only interested in the main part of the content, for example, for technical paper, I'm…
Yu Shen
  • 2,770
  • 3
  • 33
  • 48
1
vote
1 answer

split pdf to multiple html file with pdf2htmlEX

I'm trying to split a PDF file into separate HTML files. I mean for each PDF page I want an HTML file. This is how I do it: pdf2htmlEX --split-pages 1 LMS.pdf --page-filename lms%03.html In the result I got an empty LMS.html and other files:…
HamidIng
  • 105
  • 13
0
votes
0 answers

How to identify the modified content in a pdf file?

Now I have a pdf file which I see the creation time and the modification time. Is there a way to know which part (e.g. tables/figures/text) are modified in the metadata? In other words, how could I identify the difference between the initial pdf…
Syhaa
  • 1
0
votes
0 answers

pdf2HtmlEX process PDF coredump

I use following command tansform pdf to html. then I got croedump file. ./pdf2htmlEX --zoom 1 --dest-dir ./pdf_test --optimize-text 1 --zoom 1.4 --process-outline 0 --embed-image 0 --font-format ttf pdf_test/020616320411_2.pdf [coredump message is…
0
votes
0 answers

Using co-ordinates in XML generated by poppler to build an email template

Generated a 72 dpi image and XML with zoom as 1 from this PDF. Although the DPI was 72, to be able to make the conversion of co-ordinates in the XML to pixel possible had to iteratively tweak the DPI using this sheet. 90.5 seems to work well.…
qwertynik
  • 118
  • 2
  • 10
0
votes
1 answer

Convert PDF to HTML without losing any format

I'm developing a Python Flask webapp and I'm trying to convert some user uploaded pdfs to nicely formatted HTML, like the HTML that is being produced when you display a pdf inside an iframe. I tried several things so far: the pdfminer.six library,…
robo-monk
  • 134
  • 3
  • 9
0
votes
1 answer

Pdf2htmlEX common error "Cannot load font"

Running the pdf2htmlEX.exe Windows binary from the command prompt works as expected. While, running the pdf2htmlEX Windows binary in a wrapper (.Net in my case) I received an error like the one below. __tmp_font1.ttf is not in a known format (or…
Bernesto
  • 1,368
  • 17
  • 19
0
votes
1 answer

Pdf2Html Installation

I 'm trying to install Pdf2HtmlEx Software on Ubuntu Server 18.04.1 LTS. The repository is not maintained but the sotware is very useful for me. I installed it on Xubuntu desktop distro and on a docker image but i can't do it on ubuntu server. It…
Agus Trombotto
  • 117
  • 2
  • 7
0
votes
2 answers

Install pdf2htmlEX on heroku

I used this Aptfile: fonts-liberation libreoffice-base-core libreoffice-calc libreoffice-writer libreoffice libpython2.7 pdf2htmlex poppler-utils And installation completed successfully. I even checked version of pdf2htmlEX in heroku…
0
votes
0 answers

running Pdf2htmlEX on linux using php

Kindly I request your help on the following issue: I am using pdf2htmlEX to convert my pdf files to HTML. The tool is working perfectly in WAMP; however, when I implement it on my Linux server, the tool is not working. My php code:
0
votes
0 answers

pdfminer when I am trying to run pdf2txt.py not working in windows

I have installed pdfminer and when I am trying to run pdf2txt.py test.pdf -t html -o test.html no error showing and command also not executing in windows. Please help me how can i convert true pdf files in html file. Thanks.
0
votes
1 answer

pdf2htmlEX's output shows Times New Roman font for only a few characters?

I have never seen anything like this. I use a tool called pdf2htmlEX, which converts a PDF to HTML, but I have a weird issue. Look at this screenshot: See the first character (W)? It's in Times New Roman. Now here's the even more weird part: Only…
MortenMoulder
  • 6,138
  • 11
  • 60
  • 116
0
votes
1 answer

Pdf2htmlEx: The html size converted by pdf is very large?

Now I convert pdf to html via pdf2htmlEx, Source file pdf 21MB, Converted html nearly 900MB, Conversion command: pdf2htmlEX --no-drm 0 --embed-image 1 --dest-dir ./output09 ./b.pdf ./b.html Is there any way to improve the size of the output html?
charisMao
  • 99
  • 14
0
votes
2 answers

Getting text location from pdf

I want to know the location of all the words in the pdf page. I have been trying to find something on the web but couldn't. Can anyone help me which library (preferably in java platform) should I use?
Prabhjot Rai
  • 27
  • 1
  • 3