2

I'm trying to send a PDF for content extraction to a Tika Server but always get the error: "Cannot convert text from stream using the source encoding"

This is how Tika is expecting the files:

"All services that take files use HTTP "PUT" requests. When "PUT" is used, the original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers)." Source https://wiki.apache.org/tika/TikaJAXRS#Services

What is the correct way of sendig the file with XMLHttpRequest()?

Code:

var response, error, file, blob, xhr;

file = new File("/PROJECT/web/dateien/ai/pdf.pdf");

blob = file.toBuffer().toBlob("application/pdf");
url = "http://localhost:9998/tika";

// send data
try {
    xhr = new XMLHttpRequest();
    xhr.open("PUT", url);
    xhr.setRequestHeader("Accept", "text/plain");
    xhr.send(blob);
} catch (e) {
    error = e;
}

({
    response: xhr.responseText,
    status: xhr.statusText,
    error: error,
    type: xhr.responseType,
    blob: blob
});

Error:

output result/error

Stefan
  • 425
  • 2
  • 8

1 Answers1

2

I suspect PUT request to be converted into a POST request by wakanda when there is blob in XHR body. Can you wireshark your XHR request and add details ? If so, you can probably fill an issue in wakanda (https://github.com/Wakanda/wakanda-issues/issues)

Hope it helps, Yann

Yann
  • 478
  • 5
  • 10
  • 1
    Looks like you're right ... Wakanda ist doing a POST request instead of a PUT (POST /tika HTTP/1.1). I will file an issue. – Stefan Jul 11 '16 at 08:51