Pig - load Word documents (.doc & .docx) with pig

Question

I can't load Microsoft Word documents (.doc or .docx) with pig. Indeed, when i try to do so, by using TextLoader(), PigStorage() or no loader at all, it doesn't work. The output is some weird symbols.

I heard that I could write a custom loader in JAVA but it seems really difficult and I don't underdstand how we can program one of these at the moment.

I would like to put all the .doc file content in a single chararray bag so I could later use a filter function to process it.

How could I do ?

Thanks

score 1 · Answer 1 · edited May 23 '17 at 11:57

1

They are right. Since .doc and .docx are binary formats, simple text loaders won't work. You can either write the UDF to be able to load the files directly into Pig, or you can do some preprocessing to convert all .doc and .docx files into .txt files so that Pig will be loading those .txt files instead. This link may help you get started in finding a way to convert the files.

However, I'd still recommend learning to write the UDF. Preprocessing the files is going to add significant overhead that can be avoided.

Update: Here are a couple of resources I've used for writing my java (Load) UDFs in the past. One, Two.

edited May 23 '17 at 11:57

Community

1
1

answered Aug 29 '13 at 17:01

mr2ert

5,146
1
21
32

Thanks for the answer. Do you know where I could find a good and simple tutorial for writing the UDF please ? – shanks_roux Aug 30 '13 at 14:12
@shanks_roux I've added some resources. They don't explicitly walk you through the process, but you should be able to patch something together from them. – mr2ert Aug 30 '13 at 15:36

Pig - load Word documents (.doc & .docx) with pig

1 Answers1