How to use lxml to parse xml extract of pymupdf?

Question

So I read each page of a pdf and appended every xml extract to a string variable. Using Page.get_text(“xml”). The text output consisted of many units of

<page id="page0" width="595.276" height="841.89">\n<block bbox="84.95639 235.90979 382.4564 316.3398">\n<line bbox="84.96 235.90979 382.4564 278.3298" wmode="0" dir="1 0">\n<font name="AkkuratPro-Bold" size="35">

I understand that these are bounding boxes around texts and in the documentation it's specified that these are best parsed using lxml. And so I tried the below way of implementation.

from lxml import etree

root = etree.fromstring(texts)

and got the following error:

Traceback (most recent call last):

  File "C:\Users\z34534534\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-11-209409d1172d>", line 3, in <module>
    root = etree.fromstring(texts)

  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring

  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument

  File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc

  File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc

  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc

  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult

  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError

  File "<string>", line 196
XMLSyntaxError: Extra content at the end of the document, line 196, column 2

I really would like to know the current way of implementing lxml and using the bounding box to get the text out of the pdf document.

First of all make sure that `texts`variable contains valid xml. — LMC, Aug 03 '21 at 20:13
Worked with single page `doc = etree.fromstring(page.get_text('xml'))` — LMC, Aug 03 '21 at 20:28

How to use lxml to parse xml extract of pymupdf?

0 Answers0