0

I'm trying to get the boundingRect of each word in a HTML.

For example,

<html><body>Lorem ipsum dolor</body></html>

I want the output as [x, y, width, height] - word

[ 8, 8, 44.671875, 19 ] - Lorem
[ 56.5, 8, 43.125, 19 ] - ipsum
[ 103.4, 8, 35.02, 19 ] - dolor

I'm using Chrome DevTools Protocol (CDP) to get the DOMSnapshot which gives the bounding rect for a line as a whole and not for individual words. (my-source-code)

[ 8, 8, 130.46875, 19 ] Lorem ipsum dolor

If I wrap every word in the HTML with a span tag, Chromium provides the desired result. But this solution seems hacky. Is there a better way to do this?

Note:

  1. The text content can have styles and fonts associated with it. So precomputed width for each character is not an option.
  2. I can rasterize the page to a PDF using CDP and get word iterator with Foxit or similar libraries. But I'd prefer to do things completely with NodeJS.
XOR
  • 314
  • 5
  • 11

1 Answers1

0

CDP handles nodes. On the example you've given, we have a text node as a child of the body node. the value of the text node is "Lorem ipsum dolor". However, if we have the following HTML:

<html>
<body>
<a>Lorem<a>
" "
<b>ipsum</b>
" "
<c>dolor</c>
</body>
</html>

We will be able to separate the words using the different text nodes. You could technically look for text nodes and add extra nodes in order to do so, but this will make the process a lot heavier.

To sum this up for you - since in some cases multiple words can be the value of one text node, we cannot get the coordinates (or bounding rects) of each word in the node separately using CDP (without heavily interfering with the page).

Kle0s
  • 113
  • 1
  • 8