0

The Scenario

I am running a VueJs client, a NodeJs Restify API Server, and a Tika-server out of the official Docker Image. A user makes a POST call with formData containing a PDF file to be parsed. The API server receives the POST call and I save the PDF on the server. The API server should PUT the file to the unpack/all endpoint on the Tika-server and receive a zip containing a text file, a metadata file, and the set of images in the PDF. I would then process the zip and pass some data back to the client.

The Problem

I create a buffer containing the file to be parsed using let parsingData = fs.createReadStream(requestFilename); or let parsingData = fs.readFileSync(requestFilename);, set the axios data field to parsingData, then make my request. When I get the response from the Tika-server, it seems the Tika-server has treated the request as empty; within the zip, there are no images, the TEXT file is empty, the METADATA.

When I make the following request to the Tika-server via CURL curl -T pdf_w_images_and_text.pdf http://localhost:9998/unpack/all -H "X-Tika-PDFExtractInlineImages: true" -H "X-Tika-PDFExtractUniqueInlineImagesOnly: true"> tika-response.zip, I get a response zip file containing accurate text, metadata, stripped images.

The Code

let parsingData = fs.createReadStream('pdf_w_images_and_text.pdf');

axios({
    method: 'PUT',
    url: 'http://localhost:9998/unpack/all',
    data: parsingData,
    responseType: 'arraybuffer',
    headers: {
        'X-Tika-PDFExtractInlineImages': 'true',
        'X-Tika-PDFExtractUniqueInlineImagesOnly': 'true'
    },
})
.then((response) => {
    console.log('Tika-server response recieved');
    const outputFilename = __dirname+'\\output.zip';
    console.log('Attempting to convert Tika-server response data to ' + outputFilename);
    fs.writeFileSync(outputFilename, response.data);
    if (fs.existsSync(outputFilename)) {
        console.log('Tika-server response data saved at ' + outputFilename);
    }
})
.catch(function (error) {
    console.error(error);
});

The Question

How do I encode and attach my file to my PUT request in NodeJs such that the Tika-server treats it as it does when I make the request through CURL?

Dent7777
  • 220
  • 3
  • 16

1 Answers1

1

Axios is sending the request with a content type of application/x-www-form-urlencoded and therefore the file content isn't being detected and parsed.

You can change this by passing either the known content type of the file, or a content type of application/octet-stream to allow Apache Tika Server to auto-detect.

Below is a sample based on your question's code that illustrates this:

#!/usr/bin/env node

const fs = require('fs')
const axios = require('axios')

let parsingData = fs.createReadStream('test.pdf');

axios({
    method: 'PUT',
    url: 'http://localhost:9998/unpack/all',
    data: parsingData,
    responseType: 'arraybuffer',
    headers: {
        'X-Tika-PDFExtractInlineImages': 'true',
        'X-Tika-PDFExtractUniqueInlineImagesOnly': 'true',
        'Content-Type': 'application/octet-stream'
    },
})
.then((response) => {
    console.log('Tika-server response recieved');
    const outputFilename = __dirname+'/output.zip';
    console.log('Attempting to convert Tika-server response data to ' + outputFilename);
    fs.writeFileSync(outputFilename, response.data);
    if (fs.existsSync(outputFilename)) {
        console.log('Tika-server response data saved at ' + outputFilename);
    }
})
.catch(function (error) {
    console.error(error);
});
Dave Meikle
  • 226
  • 2
  • 5