Why is my Tika Metadata object not being populated when using ForkParser?

Question

ForkParser is a new Tika parser that was introduced in Tika version 0.9, located in org.apache.tika.fork. The new parser forks off a new jvm process to analyze the passed file stream. I figured this may be a good way to constrain how much memory I'm willing to devote to Tika's metadata extraction process. However, the Metadata object is not being populated with the appropriate metadata properties like it would when using an AutoDetectParser. Tests have shown that the BodyContentHandler object is not null.

Why is the Metadata object not being populated with anything (except the manually added RESOURCE_NAME_KEY)?

public static Metadata getMetadata(File f) {
    Metadata metadata = new Metadata();
    try {
        FileInputStream fis = new FileInputStream(f);
        BodyContentHandler contentHandler = new BodyContentHandler(-1);
        ParseContext context = new ParseContext();
        ForkParser parser = new ForkParser();

        parser.setJavaCommand("/usr/local/java6/bin/java -Xmx64m");
        metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());

        parser.parse(fis, contentHandler, metadata, context);
        fis.close();

        String contentType = metadata.get(Metadata.CONTENT_TYPE);

        logger.error("contentHandler: " + contentHandler.toString());
        logger.error("metadata: " + metadata.toString());

        return metadata;

    } catch (Throwable e) {
        logger.error("Exception while analyzing file\n" +
        "CAUTION: metadata may still have useful content in it!\n" +
        "Exception: " + e, e);

        return metadata;
    }
}

score 3 · Accepted Answer · answered Dec 02 '11 at 09:44

3

The ForkParser class in Tika 1.0 unfortunately does not support metadata extraction since for now the communication channel to the forked parser process only supports passing back SAX events but not metadata entries. I suggest you file a TIKA improvement issue to get this fixed.

One workaround you might want to consider is getting the extracted metadata from the <meta> tags in the <head> section of the XHTML document returned by the forked parser. Those should be available and contain most of the metadata entries normally returned in the Metadata object.

answered Dec 02 '11 at 09:44

Jukka Zitting

1,092
6
13

Thanks Jukka, I filed a Tika improvement issue as you mentioned. I will look into interrogating the XHTML document produced by the parse method, I still need to figure out how to do that. Where does the XHTML document get returned from? What I see is that the parse method is a void method and returns nothing. Is it in the content handler or in one of the other objects passed to the parse method? – anchovie Dec 03 '11 at 02:28

Why is my Tika Metadata object not being populated when using ForkParser?

1 Answers1