2

I am building an application that scans pdf files and extract data from them.

I have already built an application that does batch processing using spark core but now I want the data to be continuously streamed from the directory.

How can I use spark streaming filestream method to read pdf files from a directory?

And should this directory be an hdfs directory ?

thanks in advance.

Praveen Kumar K S
  • 3,024
  • 1
  • 24
  • 31
fady zohdy
  • 45
  • 1
  • 8
  • Please post your sample code of your tried options! This is to understand your current api's which you are using and to advise you accordingly. – Praveen Kumar K S Jul 31 '16 at 11:56
  • @PraveenKumar i am using spark 1.6.2 i don't really see the need in posting my sample code because it is irrelevant of what i am asking about. – fady zohdy Jul 31 '16 at 18:17
  • @fadyzohdy, did you find a solution for this? If so, can you please share the idea. I'm having similar requirement, but for ORC file though. – Sudheer Palyam Apr 24 '17 at 11:19

0 Answers0