2

in spark streaming, I want to use fileStream to monitor a directory. But the files in that directory are compressed using lz4. So the new lz4 files are not detected by the following code. How to detect these new files?

val list_join_action_stream = ssc.fileStream[LongWritable, Text, TextInputFormat](gc.input_dir, (t: Path) => true, false).map(_._2.toString)

I know the textFile function could read .lz4 format data. But I'm using spark streaming with fileStream function...

user2848932
  • 776
  • 1
  • 14
  • 28
  • Are the files in the input directory named with the `.lz4` extensions? – vanekjar May 13 '15 at 17:22
  • possible duplicate of [Decompressing LZ4 compressed data in Spark](http://stackoverflow.com/questions/24985704/decompressing-lz4-compressed-data-in-spark) – vanekjar May 13 '15 at 17:23
  • yes, the files in the input dir are named with .lz4 ext – user2848932 May 14 '15 at 02:51
  • @vanekjar i 'm using fileStream in spark streaming, the question you give me is using textFile. – user2848932 May 14 '15 at 02:54
  • Spark uses Hadoop input format for reading files. So `.textFile` and `.fileStream` with `TextInputFormat` should be the same. Hadoop should handle the input compression transparently. What is your Hadoop version? – vanekjar May 14 '15 at 09:37
  • @vanekjar my hadoop version is hadoop 2.6, my experiment shows they are different... – user2848932 May 14 '15 at 12:26
  • @user2848932, did you find solution to spark stream .lz4 files? If so, can you please share the details. I'm having the similar streaming challenge, but with .ORC files. – Sudheer Palyam Apr 25 '17 at 06:04

0 Answers0