0

I am trying to convert this document: http://www.redbooks.ibm.com/redpapers/pdfs/redp5213.pdf to JSON answer units, but it (and many similar others) just won't process through the service. If I try to process it through the demo page at https://document-conversion-demo.mybluemix.net/ it either returns the error 'Missing required parameters: either params.file or params.document_id must be specified' or it simply returns a blank result. If I try it through the REST API via Node.js and watson-developer-cloud, it returns error code 400 along with the message 'The input document failed to be converted because Exception while converting PDF to HTML'. (Why it's trying to convert to HTML I have no clue - I've specified JSON answer units and this code has worked fine with some other documents I've tried).

Is there something unusual about these redpapers that I'm trying to convert, or is the document conversion service having issues?

David Powell
  • 537
  • 1
  • 4
  • 16

1 Answers1

0

I downloaded that [Redpaper][1] to my laptop, then went to the Document Conversion Demo, clicked Choose your file and uploaded the PDF I had just downloaded and then clicked Answer units JSON as the desired output format. At first, I didn't see anything happen. Hitting the download icon to the right of Output document gave me the converted JSON output as a download and also filled it in on the web page. Reloading the page, I got the conversion to appear on the demo page without having to hit the download.

I'm a newbie to Node.js. I got the following code to work (based on Document Conversion via Node) using the current watson-developer-cloud package, which is version 1.8.0.

var watson = require('watson-developer-cloud');
var fs = require('fs');

var document_conversion = watson.document_conversion({
  username:     'username',
  password:     'password',
  version:      'v1',
  version_date: '2015-12-15'
});

document_conversion.convert({
  file: fs.createReadStream('redp5213.pdf'),
  conversion_target: "ANSWER_UNITS"
}, function (err, response) {
  if (err) {
    console.error(err);
  } else {
    console.log(JSON.stringify(response, null, 2));
  }
});

This did take between ten and twenty seconds to run on a coffee shop WiFi.

Oh, and I forgot to address your question "Why [is it] trying to convert to HTML"?. The Document Conversion service always converts to HTML and then to normalized HTML. For answer units or plain text, it takes an additional step of converting the normalized HTML to the requested format. This is described in Document Conversion - Customizing (which strikes me as oddly out of the way for basic flow documentation).

[1]: http://www.redbooks.ibm.com/redpapers/pdfs/redp5213.pdf Redpaper

Bruce Adams
  • 443
  • 2
  • 5
  • Since you tried the same document on the demo site and it worked, I went back and tried it myself again. I tried the exact same steps 5 times and got very odd and inconsistent results - the first two times resulted in the same behavior that I described above. The third try, however, worked pretty much as you said. Two more tries after that also seemed to work fine. After that, I tried updating my watson-developer-cloud library to the latest version (it was at 1.4.1 and I updated it to 1.7.0) and retried the documents from Node.js, but still get the same errors as before. – David Powell May 14 '16 at 19:03
  • The code that I'm using to call the document conversion service is below. The PDF to be converted is loaded into the variable "content": document_conversion.convert({ file: new Buffer(content), conversion_target: "ANSWER_UNITS", content_type:'application/pdf' }, function (err, response){ if (err) {... (Sorry it doesn't format well in comments) – David Powell May 14 '16 at 19:03