Getting a strange error from Watson's Document Conversion service

Question

I am trying to convert some documents into answer units with Watson's Document Conversion service, using the watson-developer-cloud Javascript library in Node.js. Certain ones (an example is at IBM internal link and is a .DOCX file) return this error:

Error: code:400 error: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

If I try to convert it via the document conversion demo site, it seems to convert without error. My program downloads the file from the source, writes it to disk, and then uploads it to the Document Conversion service via the above mentioned library.

Is there any way around this error? Consider that this conversion is part of a massive automated conversion of thousands of documents, so manual handling for these outliers is out of the question.

Here is a post from someone who got the same error message from Microsoft Excel. http://stackoverflow.com/questions/12593752/why-do-i-failed-to-read-excel-2007-using-poi. I'm not sure how much that helps, but it might point you in the right direction. Please remove the link to the IBM internal doc, as your question is really about doc formats and not doc content. If you need to share details on IBM internal docs, the place to do it is an IBM internal forum. — ralphearle, Nov 07 '16 at 16:58

score 1 · Accepted Answer · answered Nov 08 '16 at 17:55

The service attempts to autodetect the media type of the uploaded file using the first few bytes of the file, and the file name.

If the file name is unavailable (i.e., not passed in by your user), you could provide the media type of the file you are uploading in the file portion of the convert call:

file: {
    value: fs.createReadStream('filename'),
    options: {
      contentType: 'application/vnd.openxmlformats officedocument.wordprocessingml.document'
    }
}

"contentType" doesn't seem to work but "content_type" does. – David Powell Nov 11 '16 at 22:44 — David Powell, Nov 11 '16 at 22:44

Getting a strange error from Watson's Document Conversion service

1 Answers1