I can't find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?
9 Answers
-
1Note: textract uses catdoc for `.doc` files, and doesn't work in windows. – Tracker1 Dec 29 '14 at 22:23
-
1node-office is not in active development (npm says end of life), hwile textract is being actively developed as of Sept 2016. – steampowered Sep 11 '16 at 20:20
-
@NayanPatel this is because of some recent updates. I have a project that depends on that behavior, so I've pinned my version. – James_1x0 Sep 16 '19 at 14:58
-
@James_1x0 what is the version you have pinned the version. – Nayan Patel Sep 17 '19 at 05:01
Looks like there's a few for pdf, but I didn't find any for Word.
CPU bound processing like that isn't really Node's strong point anyway (i.e. you get no additional benefits using node to do it over any other language). A pragmatic approach would be to find a good tool and utilise it from Node.
I have heard good things around the office about docsplit http://documentcloud.github.com/docsplit/
While it's not Node, you could easily invoke it from Node with http://nodejs.org/docs/latest/api/all.html#child_process.exec

- 2,109
- 5
- 30
- 53

- 5,156
- 3
- 36
- 40
-
The advantage of a pure JS solution is that is it portable between the browser and Node – sdgfsdh Jan 16 '18 at 16:51
You can easily convert one into another, or use for example a .doc template to generate a .pdf file, but you will probably want to use an existing web service for this task.
This can be done using the services of Livedocx for example
To use this service from node, see node-livedocx (Disclaimer: I am the author of this node module)

- 833
- 11
- 11
-
3Not relevant, the question was how to read PDF/DOC files not how to generate one. – Elhay Avichzer Dec 25 '19 at 21:26
I would suggest looking into unoconv for your initial conversion, this uses LibreOffice or OpenOffice for the actual conversion. Which adds some overhead.
I'd setup a few workers with all the necessities setup, and use a request/response queue for handling the conversion... (may want to look into kue or zmq)
In general this is a CPU bound and heavy task that should be offloaded... Pandoc and others specifically mention .docx
, not .doc
so they may or may not be options as well.
Note: I know this question is old, just wanted to provide a current answer for others coming across this.

- 19,103
- 12
- 80
- 106
you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.
var pdfText = require('pdf-text')
var pathToPdf = __dirname + "/info.pdf"
pdfText(pathToPdf, function(err, chunks) {
//chunks is an array of strings
//loosely corresponding to text objects within the pdf
//for a more concrete example, view the test file in this repo
})
var fs = require('fs')
var buffer = fs.readFileSync(pathToPdf)
pdfText(buffer, function(err, chunks) {
console.log(chunks)
})
for docx files you can use mammoth, it will extract text from .docx files.
var mammoth = require("mammoth");
mammoth.extractRawText({path: "./doc.docx"})
.then(function(result){
var text = result.value; // The raw text
console.log(text);
var messages = result.messages;
})
.done();
I hope this will help.

- 1,023
- 1
- 15
- 24
For parsing pdf files you can use pdf2json node module
It allows you to convert pdf file to json as well as to raw text data.

- 83,883
- 25
- 248
- 179
Another good option if you only need to convert from Word documents is Mammoth.js.
Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.

- 21,381
- 38
- 125
- 225
Here is an example showing how to download and extract text from a PDF using PDF.js:
import _ from 'lodash';
import superagent from 'superagent';
import pdf from 'pdfjs-dist';
const url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';
const main = async () => {
const response = await superagent.get(url).buffer();
const data = response.body;
const doc = await pdf.getDocument({ data });
for (const i of _.range(doc.numPages)) {
const page = await doc.getPage(i + 1);
const content = await page.getTextContent();
for (const { str } of content.items) {
console.log(str);
}
}
};
main().catch(error => console.error(error));

- 33,689
- 26
- 132
- 245
You can use Aspose.Words Cloud SDK for Node.js to extract text from DOC/DOCX,Open Office, and PDF. It's paid API but the free plan provides 150 free monthly API calls.
P.S: I'm developer evangelist at Aspose.
const { WordsApi, ConvertDocumentRequest } = require("asposewordscloud");
const fs = require('fs');
// Get Customer ID and Customer Key from https://dashboard.aspose.cloud/
wordsApi = new WordsApi("xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxx");
const request = new ConvertDocumentRequest({
format: "txt",
document: fs.createReadStream("C:/Temp/02_pages.pdf"),
});
const outputFile = "C:/Temp/ConvertPDFtotxt.txt";
wordsApi.convertDocument(request).then((result) => {
console.log(result.response.statusCode);
console.log(result.body.byteLength);
fs.writeFileSync(outputFile, result.body);
}).catch(function(err) {
// Deal with an error
console.log(err);
});

- 33
- 6

- 940
- 5
- 9