Streaming pdf files using spark streaming filestream

Asked Jul 31 '16 at 11:38

Active Aug 01 '16 at 01:23

Viewed 193 times

I am building an application that scans pdf files and extract data from them.

I have already built an application that does batch processing using spark core but now I want the data to be continuously streamed from the directory.

How can I use spark streaming filestream method to read pdf files from a directory?

And should this directory be an hdfs directory ?

thanks in advance.

edited Jul 31 '16 at 12:11

Praveen Kumar K S

3,024
1
24
31

asked Jul 31 '16 at 11:38

fady zohdy

Please post your sample code of your tried options! This is to understand your current api's which you are using and to advise you accordingly. – Praveen Kumar K S Jul 31 '16 at 11:56
@PraveenKumar i am using spark 1.6.2 i don't really see the need in posting my sample code because it is irrelevant of what i am asking about. – fady zohdy Jul 31 '16 at 18:17
@fadyzohdy, did you find a solution for this? If so, can you please share the idea. I'm having similar requirement, but for ORC file though. – Sudheer Palyam Apr 24 '17 at 11:19

Streaming pdf files using spark streaming filestream

0 Answers0