-3

I need to convert a few large documents to a database and I have the files in xml, xhtml, epub and pdf.

Assuming the files themselves are completely faultless, which of these formats will enable me to extract the text with the least mistakes and missing elements?

I am guessing that pdf will likely be the worst performing (I remember seeing a table of extraction performances where the best library had 98% and most were below), but I included it in the list just in case I am mistaken.

Many thanks in advance!

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Olli
  • 906
  • 10
  • 25
  • Thanks @KJ, I only have the above filetypes available, it is what is given to me. What is xmlAS.txt, xhtmlAS.txt=epubAStxt? – Olli Mar 27 '23 at 12:55
  • Thank you @KJ, ok, so from what I am understanding here, `xml` and `xhtml` should be the most useful formats? – Olli Mar 28 '23 at 10:00
  • 1
    This isn't well defined. "Loss" could mean many things. There could be varying amounts of errors in the data that have nothing to do with how it is structured. The actual formatting could have varying amounts of meaning to how the text should be interpreted. There are any number of contexts where it might be appropriate to do certain conversions (like Unicode normalization) or inappropriate to do the same conversion on the same text. – Karl Knechtel Mar 28 '23 at 15:35
  • 1
    That said: XML and XHTML are designed to store text and embed formatting commands within the text. epub is **functionally the same**; it is just a zip archive that contains XHTML. PDF, on the other hand, is designed to store graphics and use text as a graphical element. The ability to recover data from a PDF will depend on countless factors related to how the PDF was created - that ranges from "cross-compiled from a hand-coded Postscript file" to "output by a scanner's device driver without any attempt at OCR". – Karl Knechtel Mar 28 '23 at 15:37
  • @KarlKnechtel Thanks a lot, I understand this a lot better now, I think I'll give it a go with the XML and XHTML files, together with Michaels answer, I feel I have the best chance of not loosing chunks of text. – Olli Mar 28 '23 at 15:41
  • At any rate, this is not a question about Python or any other programming language - unless you meant to ask a completely different question in the body, so as to match the title. The task that you describe is also not any kind of "text parsing"; that means taking data that is already text, and deducing a structure based on what the text says (for example, looking at computer code for a mathematical expression, and creating a syntax tree). – Karl Knechtel Mar 28 '23 at 15:41
  • @KarlKnechtel well the question originally arose after me having tried with various python packages (tika, textract, py2pdf, pydocx etc) and finding out they do not recover data perfectly (py2pdf readthedocs), so before spending more time on it, I wondered, using python which document type will give me the least trouble. I find it a valid question and relevant because I that is how I look for things and usually find answers here. – Olli Mar 28 '23 at 15:45
  • That's... really not how the site works, though. "Validity" isn't determined in terms of just making sense to the person asking and representing a real task; to remain open, questions need to be on-topic, objective, clear, focused, answerable and not duplicates; and ideally, researched. They aren't written to solve an individual's bespoke problem, but to contribute to a searchable library. – Karl Knechtel Mar 28 '23 at 15:51
  • @KarlKnechtel I didn't mean it this way, but in the way that I often find answers and questions of this type (though to be fair, I also often see them being closed or removed) and find them most helpful, hence, I think it might be helpful if another person will find themselves in choosing from or export to any of the above mentioned filetypes, which I think is a fairly common situation, just doing a quick google search over SO. – Olli Mar 28 '23 at 16:14
  • I would like to add that I don't believe there is any such thing as a "bespoke" or unique problem. Otherwise sites like SO wouldn't be so useful. – Olli Mar 28 '23 at 16:17

1 Answers1

0

The problem with PDF is that, at worst, it's just a bunch of individual characters placed at particular co-ordinates on a page. (I.e., words and lines of text are all in the eye of the beholder.) Now, the particular PDF files you have might be better behaved than that, but I don't know. In any case, PDF files are complex data structures, so parsing them is complex, and extracting the text is not straightforward.

Now, technically, an xml or xhtml file could be just as hairy as a PDF. (E.g., you could have an xml file that is just a list of elements like <letter loc="234,1743">A</letter>.) But in practice, they aren't. If you can look at an xml/xhtml file and see the text you're interested in, then it will probably be easy to extract it programmatically.

Epub would be comparable to xml/xhtml in terms of losslessness, but might be a bit more complicated to deal with.

It would probably be a good idea to find out how the documents were authored, and how the various formats were derived. (I.e., if the assumption that the files are faultless is incorrect, that might have a bigger effect on the choice of format to use.)

Michael Dyck
  • 2,153
  • 1
  • 14
  • 18
  • "Epub would be comparable to xml/xhtml in terms of losslessness" It should be identical, as it is simply a zip archive of xhtml. "if the assumption that the files are faultless is incorrect" - "faultless" might not be something that can coherently be assessed. – Karl Knechtel Mar 28 '23 at 15:38
  • Thanks a lot Michael, I think I will give it a go with XML and or XHTML, I can read the text well and it's well structured. I think the files are faultless, probably originally ported using word or some similar editor. – Olli Mar 28 '23 at 15:42
  • 1
    Epub accepts a particular *profile* of XHTML. It's possible that the process of fitting the document into that profile caused some loss, though probably not textual. – Michael Dyck Mar 28 '23 at 15:46