pig - walk hdfs directory and tika-parse documents into hive?

Question

What is the best way to walk a directory structure in HDFS? Is there anyway to do this in Pig?

My reason for asking is because I have a HDFS directory tree with multiple sub-directories and many different document types such as xls, doc, docx, html, rtf, etc.

I'd like to somehow process these binary/rich text documents and extract the text from the document, to eventually end up in a hive output record. I'm looking to apache tika to do this, and i have a written a simple java command line program that seems to do it without issue. I'm planning on turning this command line program into either a Hive or Pig UDF to be called on each file of interest for text extraction.... however the last piece to the puzzle is walking the actual directory structure.

I've google for "Pig walk directory", "Cascading Walk Directory" (although it seems cascalog doesn't support my version of hadoop), etc.

At this point, unless I come across a better option I'll do a hadoop fs -ls -R /all_documents and just load that into a table in hive, to process via UDF each row after the fact.

Seems there should be a more elegant way to walk a dir-tree however?

pig - walk hdfs directory and tika-parse documents into hive?

0 Answers0