1

I use jsoup to scrape HTML. I am having problems with extracting information from html tags of the following kind:

<span class="some">&#8237;&#8237;78&#8236;&#8236;</span>

it should only be like

<span class="some">78‬‬</span>

How can I remove the HTML Entities from the string?

Patru
  • 4,481
  • 2
  • 32
  • 42

1 Answers1

0

I'm not familiar with jsoup, but if it a "normal" HTML DOM Parser that returns a "standard" HTML DOM, then what you want is not really possible. The problem is that once the DOM has been built it can't distinguish between characters that are encoded normally and one expressed as an entity anymore.

For example: <span>A</span> and <span>&#65;</span> are considered completely identical and can't be distinguished once in the DOM - both are span elements containing a text node with text A.

So what you can do is loop over all text nodes and search an replace these characters (not the entities):

void removeInvalidChars(Element element) {
  for (Node child : element.childNodes()) {
    if (child instanceof TextNode) {
      TextNode textNode = (TexNode) child;
      textNode.text( textNode.text().replaceAll("\u202C", "").replaceAll("\u202D", "") );
      // 202C and 202D are the hex codes for the decimal values 8236 and 8237
    } else if (child instanceof Element) {
       removeInvalidChars((Element) child);
    }
  }
}

If you need to distinguish between raw characters and entities, then you'll need to use a different non-DOM (e.g. event-based) HTML parser.

RoToRa
  • 37,635
  • 12
  • 69
  • 105