4

I have the following test code to detect docx content type:

@Test
    public void testContentTypeOfaWordDOCXFileIsReturnedCorrectlyByTheServer() throws IOException, TikaException {
        File docxFile = new File(FILE_COMPLETE_PATH);
        InputStream inputStream = new FileInputStream(docxFile);
        MediaType mediaType=spyServlet.getServerInducedType(inputStream);

        assertEquals(DOCX_TYPE, mediaType);
    }

while the getServerInducedType is implemented as the following:

protected MediaType getServerInducedType(InputStream inputStream) throws IOException, TikaException {
        try (BufferedInputStream buffStream = new BufferedInputStream(inputStream);
             TikaInputStream tikaInputStream = TikaInputStream.get(buffStream)
        ) {
            TikaConfig tikaConfig = new TikaConfig();
            Detector detector = tikaConfig.getDetector();
            Metadata metadata=new Metadata();
            MediaType mediaType=detector.detect(tikaInputStream, metadata);
            return mediaType;
        }
    }

Question: When I am running the above test I expect to get DOCX_TYPE which is "application/x-tika-ooxml", but I am getting "application/zip". Why?

ps. I do not have any tika.config or TIKA_CONFIG env variable (see here).

I also added tika parser and tika core to the pom file (see here)

This is the output that I get:

java.lang.AssertionError:  Expected :application/x-tika-ooxml Actual   :application/zip  <Click to see difference>

I test it with jpg file and Tika can detect it fine as image/jpeg

my pom file has the following config:

<dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.9</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.9</version>
        </dependency>
qartal
  • 2,024
  • 19
  • 31
  • `.docx` documents are actual `.zip` archives containing xml files. If the program just analyses the zip header it detects a standard zip. If you detect a standard zip, just scan the zip header and look for `[ContentTypes].xml`. If you find it you can safely assume it is a docx document. It's certainly not a single xml file but rather a collection of xml files in a .zip. – Jean-François Fabre Aug 23 '16 at 16:47
  • 1
    Try doing `metadata.set(Metadata.RESOURCE_NAME_KEY, FILE_COMPLETE_PATH);` before calling `detector.detect`. – Siguza Aug 23 '16 at 16:47
  • I am using the method getServerInducedType(InputStream inputStream) to read from a stream. In fact it reads from a stream coming from an http request. I have the above test to see if tika identifies the docx file correctly. Thus, adding the above code "metadata.set(..)" would not be applicable unfortunately. – qartal Aug 23 '16 at 17:08
  • @Jean-FrançoisFabre, thanks for the comment, can you put your suggestion as an answer and elaborate on that. – qartal Aug 23 '16 at 17:14
  • Why are you using such an old version of Apache Tika? What happens when you upgrade? – Gagravarr Aug 23 '16 at 19:14
  • @Gagravarr: the reason is that this version already existed within the project POM file. In fact, a good point you mentioned! "specifically what happens part!" :) I will try the new version to see whether I will get a better result. – qartal Aug 25 '16 at 00:44

2 Answers2

3

I'm converting my comment as an answer because OP requests it, even if it answers partially to the question.

.docx documents are actual .zip archives containing xml files with a fixed architecture.

Open a docx with 7zip you'll see that:

enter image description here

If the program just analyses the zip header it detects a standard zip. If it happens, just scan the zip header and look for [ContentTypes].xml.

If you find it you can safely assume it is a docx document.

It's certainly not a single xml file but rather a collection of xml files in a .zip

Microsoft describes the contents file by file here

open office xml document seem to rather be a single XML file rather than an archive. That's why I fail to see how Microsoft conforms to the open office standards. Beats me.

But as for the question "how to detect docx", my answer allows to do that. You "just" have to add extra code to open the zipfile and check for distinctive file / directory names.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
0

A docx is a zip, change the extension to .zip and open it to convince yourself.

It may be expecting to be pointed at the actual ooxml file within.