3

i have pst or email files in hdfs. now, i want to do text analysis by whichever component available in hadoop which suits the best. how do i start with.

Do I have to first extract the actual content out of these files and store it somewhere (in a text file for example) and then run the analysis on the text file?

please suggest me.

p.s: i came across this while i began to search in google. is this only option left or any other solution available.

Community
  • 1
  • 1
natarajan k
  • 406
  • 9
  • 24
  • What did you try? How far did you get? What errors/problems did you face? – Gagravarr Jul 04 '15 at 21:07
  • I have created sample files from outlook email such as (.pst),(.oft),(.msg),(.txt),(.mht),(.htm) and loaded these files as such in HDFS. Now, I want to extract the contents from these files and analyse. do we need to use apache tika to extract the contents or by which component we can extract and analyse the data directly? – natarajan k Jul 09 '15 at 06:23
  • can i use Spark mllib to extract the contents which internally uses tika. is this right? – natarajan k Jul 09 '15 at 07:11

0 Answers0