Remove html entities with jsoup in android

Question

I use jsoup to scrape HTML. I am having problems with extracting information from html tags of the following kind:

<span class="some">&#8237;&#8237;78&#8236;&#8236;</span>

it should only be like

<span class="some">78‬‬</span>

How can I remove the HTML Entities from the string?

yes i need to get it without HTML Entities , when i extract it with jsoup it convert to space — user3568736, Apr 24 '14 at 12:25
you wants like this http://stackoverflow.com/questions/17643512/android-string-encoding-and-html-entities-converting — Robi Kumar Tomar, Apr 24 '14 at 12:26
potentially any character could be expressed as entity. it appears dodgy to simply discard any entity. — njzk2, Apr 24 '14 at 13:37

score 0 · Answer 1 · answered Apr 24 '14 at 14:26

I'm not familiar with jsoup, but if it a "normal" HTML DOM Parser that returns a "standard" HTML DOM, then what you want is not really possible. The problem is that once the DOM has been built it can't distinguish between characters that are encoded normally and one expressed as an entity anymore.

For example: <span>A</span> and <span>A</span> are considered completely identical and can't be distinguished once in the DOM - both are span elements containing a text node with text A.

So what you can do is loop over all text nodes and search an replace these characters (not the entities):

void removeInvalidChars(Element element) {
  for (Node child : element.childNodes()) {
    if (child instanceof TextNode) {
      TextNode textNode = (TexNode) child;
      textNode.text( textNode.text().replaceAll("\u202C", "").replaceAll("\u202D", "") );
      // 202C and 202D are the hex codes for the decimal values 8236 and 8237
    } else if (child instanceof Element) {
       removeInvalidChars((Element) child);
    }
  }
}

If you need to distinguish between raw characters and entities, then you'll need to use a different non-DOM (e.g. event-based) HTML parser.

Remove html entities with jsoup in android

1 Answers1