0

How to retrive this <DIV> with id 48227783 value using Apache TIKA ?

<div class="postcolor post_text" data-postid="48227783">Ownage!<br /></div>

I try to retreive the value 'Ownage!' , I tried to use mapSafeElement , DefaultHtmlMapper objects seems cannot find it anywhere.

Thanks.

Trinadh Gupta
  • 306
  • 5
  • 18
akunyer
  • 107
  • 11

1 Answers1

0

I would override the mapSafeElement, mapSafeAttribute and isDiscardElement methods to access this element during the parse, since Tika may be rejecting the non-standard/non-"safe" attribute "data-postid" - as shown below.

Then, you would use this class via the ParseContext object, as follows:

InputStream input = <your Uri/file/string input stream>;
ParseContext parseContext = new ParseContext();
parseContext.set(HtmlMapper.class, AllTagMapper.class.newInstance());       

HtmlParser parser = new HtmlParser();
parser.parse(input, new ContentHandler(), new Metadata(), parseContext);

// Override HtmlMapper to process all tags and tributes. 

class AllTagMapper implements HtmlMapper {

    @Override
    public String mapSafeElement(String name) {
        return name.toLowerCase();
    }

    @Override
    public boolean isDiscardElement(String name) {
        return false;
    }

    @Override
    public String mapSafeAttribute(String elementName, String attributeName) {
        return attributeName.toLowerCase();
    }

}