I want to use Python to parse HTML markup, and given one of the resultant DOM tree elements, get the start and end offsets of that element within the original, unmodified markup.
For example, given the HTML markup (with \n
EOL chars)
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
<head>
<title>No Longer Human</title>
<meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.expected.resource"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
<link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
</head>
<body class="calibre" aid="0">
</body>
</html>
(example with BeautifulSoup, but I'm not attached to any parser in particular)
>>> soup = bs4.BeautifulSoup(html_markup)
>>> title_tag = soup.find('title')
>>> get_offsets_in_markup(title_tag) # <-------- how do I go about doing this?
(109, 139) # <----- source mapping info I want to get
>>> html_markup[109:139]
'<title>No Longer Human</title>'
I don't see this functionality in the APIs of any of the Python HTML parsers available. Can I hack it into one of the existing parsers? How would I go about doing that? Or is there another, better approach?
I realize that str(soup_element)
serializes the element back into markup (and I can hypothetically recurse down the tree saving the start and end indices as I go), but the markup returned by doing that, although semantically equivalent to the original, doesn't match the original char-for-char. None of the available Python parsers do.