I have installed Hadoop and hive. I can process and query over xls, tsv files using hive. I want to process other files such as docx, pdf, ppt. how can i do this? Is there any separate procedure to process these files in AWS? please help me.
Asked
Active
Viewed 38 times
1 Answers
1
There isn't any difference in consuming those files as in any Hadoop platform. For easy access and durable storage - you may put those files in S3.

Naveen Vijay
- 15,928
- 7
- 71
- 92
-
Thanks for your reply... I want to know how can i run query over docx, pdf, ppt files. – Mahmudul Hasan Mar 29 '15 at 05:46
-
I believe, There are open source APIs to interact / extract data from all of the above mentioned files. You would use it in conjunction with the Hadoop / EMR – Naveen Vijay Apr 01 '15 at 02:13