How to run Hadoop wordcount program on pdf and doc files?

Question

How to run Hadoop wordcount program on pdf and doc files? When I try to run it on pdf files the output shows weird characters.

This post may help you to get further: http://stackoverflow.com/a/9298965 — Lorand Bendig, Mar 09 '13 at 09:03

score 2 · Answer 1 · answered Mar 08 '13 at 20:43

2

The file formats you mentioned are binary and not suitable as input to word count without pre-processing them into plain text. You will first have to convert them using some other tool/library into a plain text format.

There are probably some free command-line utilities out there which can help you do this.

answered Mar 08 '13 at 20:43

Javanator

109
6

The statement that binary file formats are not suitable as input and you would need to convert to plain text is completely wrong, the most efficient Hadoop programs use binary input as it avoid the need of parsing the input and thus increases efficiency. – Charles Menguy Mar 09 '13 at 20:18

score 2 · Answer 2 · answered Mar 09 '13 at 20:17

Hadoop is not limited to processing clear-text files, you can of course process binary files, for example SequenceFiles are the most common binary format in Hadoop, but if you want a custom binary format you can also do it by implementing your own InputFormat and RecordReader.

I would recommend looking at this great article on processing .doc files in Hadoop, and this one on processing .docx and .pdf files, which should fit your needs.

How to run Hadoop wordcount program on pdf and doc files?

2 Answers2