0

There is a pdf document,I want to convert it to xml or html.

Since the pdf document contains some tables,when it have converted to xml or html,I can not know which is table data and which is text.

I want to get tables data to store the database.

Can xpdf or mupdf make it?

Thanks.

allen
  • 250
  • 4
  • 13

2 Answers2

1

PDF does not (in general) contain information about text. Text is text, there is no information to identify text in a table.

Therefore ther is no reliable way for any PDF reading application to identify text as beig part of a table. So MuPDF will not be able to tell you this.

You can, of course, attempt to apply heuristics yourself, identifying text in rows at the same vertical offset, and looking for text spaced horizontally at regular x offsets.

KenS
  • 30,202
  • 3
  • 34
  • 51
  • To improve the chances of better export to HTML/XML, it might be worthwhile to make the document accessible; this would add a structure (pretty much HTML) to the document which then can be used. There are tools around, but there may still be some manual labor involved. If, however, the document already comes along with a structure, you should be able to get that information. – Max Wyss Jul 27 '15 at 08:31
  • I have converted pdf to xml with [pdftohtml](http://pdftohtml.sourceforge.net/).I want to pick up the table data from that xml file with such as **Name ** information.Because the **top** and ***left* is the coordinate of string **Name**.I need to diff the string of table of **the pdf** and others.So this idea is ok? I get the **Gfx.cc** code from **xpdf**,there is a function named **opRectangle**,but i dont sure this is about **table of pdf**. – allen Jul 28 '15 at 10:16
0

You can look at the free tabula https://tabula.technology/

"A tool to liberate data tables locked inside PDF files".

It is a web application. You can install tabula on a linux or windows box and use it from the others pc.

Massimo
  • 3,171
  • 3
  • 28
  • 41