6

When I use Apache Tika to determine the file type from the content. XML file is fine but not the json. If content type is json, it will return "text/plain" instead of "application/json".

Any help?

public static String tiKaDetectMimeType(final File file) throws IOException {
    TikaInputStream tikaIS = null;
    try {
        tikaIS = TikaInputStream.get(file);
        final Metadata metadata = new Metadata();
        return DETECTOR.detect(tikaIS, metadata).toString();
    } finally {
        if (tikaIS != null) {
            tikaIS.close();
        }
    }
}
songjing
  • 545
  • 4
  • 22

2 Answers2

7

JSON is based on plain text, so it's not altogether surprising that Tika reported it as such when given only the bytes to work with.

Your problem is that you didn't also supply the filename, so Tika didn't have that to work with. If you had, Tika could've said bytes=plain text + filename=json => json and given you the answer you expected

The line you're missing is:

metadata.set(Metadata.RESOURCE_NAME_KEY, filename);

So the fixed code snippet would be:

tikaIS = TikaInputStream.get(file);
final Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
return DETECTOR.detect(tikaIS, metadata).toString();

With that, you'll get back an answer of JSON as you were expecting

Gagravarr
  • 47,320
  • 10
  • 111
  • 156
  • Anyone have suggestions if you're not dealing with a file, or can't trust the file extension as a hint? – milletron Oct 11 '17 at 16:26
  • @milletron Pass Apache Tika the contents of the File, and it'll do mime magic based detection too – Gagravarr Oct 12 '17 at 00:39
  • Thanks @Gagravarr. Yes, I can tell the mime-detection works overall with the dozen or so different byte streams I push through, but still doesn't distinguish Json separately from plain text still though (with 1.15 at least). I guess one would have to write a new Detector similar to XML and HTML? I'm just surprised JSON isn't included already. – milletron Oct 12 '17 at 14:23
  • @milletron The JSON format doesn't have any well-known magic at the start, beyond looking for `{"` or `["` which isn't unique nor even always the case with json, it's tough... – Gagravarr Oct 13 '17 at 11:37
0

For those not dealing with a file I've found it easiest to just run the payload through Jackson to see if it can be parsed or not. If Jackson can parse it you know 1) you are working with JSON and 2) the JSON is valid.

private static final ObjectMapper MAPPER = new ObjectMapper();
public static boolean isValidJSON(final String json) {
    boolean valid = true;
    try {
        MAPPER.readTree(json);
    } catch (IOException e) {
        valid = false;
    }
    return valid;
}
ptdunlap
  • 66
  • 4