2

Can Dstream with new names be created and older dstream be destroyed on runtime?

//Read the Dstream 
inputDstream = ssc.textFileStream("./myPath/")

Example: I am reading a file called cvd_filter.txt in which every single line contains a string which is supposed to be a filter criteria for dstream. This file gets updated(also can be appended) with new values:

Example: At time 10:00 ; cat cvd_filter.txt

"1001" "1002" "1003"

// Read cvd_filter.txt every 5 mins and do creation/destruction of dstreams.

with open(cvd_filter.txt) as f:
    content = f.readlines()
    dstream_content[0] = inputDstream.filter(lambda a: content[0] in a)

// At this point (dstream_1001 , dstream_1002, dstream_1003) should get created. 
// NOW, DO SOME OPERATION ON INDIVIDUAL dstreams. 

At time 10:05 ; cat cvd_filter.txt

"1004" "1002" "1003"

// Create dstream_1004 for new filter string, Destroy dstream_1001 only 
// but retain dstream_1002 and dstream_1003.  
At this point (dstream_1004 , dstream_1002, dstream_1003) should be present. 
// NOW, DO SOME OPERATION ON INDIVIDUAL dstreams.
vkb
  • 458
  • 1
  • 7
  • 18
  • A little similar question: [link](http://stackoverflow.com/questions/34897236/spark-get-multiple-dstream-out-of-a-single-dstream) however, i do not want to create individual jobs as suggested in Option1 since it will need to read the same streaming file multiple times. – vkb Jun 13 '16 at 22:57

1 Answers1

0

NO. No new streams or operations on DStreams can be added to a running context. I'd suggest to model your usecase in terms of foreachRDD, which gives you the freedom to do arbitrary operations on the underlying RDDs. For example:

val dstream = ??? /// original dstream
dstream.foreachRDD{rdd =>
  val filters =  // read file
  val filteredRDDs = filters.map(f => rdd.filter(elem => elem.contains(f))
  ...
}

Then further express the operations you need on the different filtered RDDs. DStreams delegate all transformation operations to the underlying RDDs, therefore you should be able to express your business logic in this way.

maasg
  • 37,100
  • 11
  • 88
  • 115
  • This one looks good if i had only one operation to be done on filteredRDDs. However, my business logic needs to create a window of different length on different filteredRDDs and maintain those filteredRDD windows until the strings in cvd_filter.txt file changes. For example, dstream_1001 should be maintained for 5 mins until it disappeared from cvd_filter.txt at 10:05, whereas dstream_1002, dstream_1003 should maintain window for much longer length. If my original dstream gets new file every minute, will the window dstream_100* persist the previous timestamp elements? – vkb Jun 14 '16 at 14:52
  • @VinayKumar Could you explain what you want to achieve? When you say "window", is it with the intention of preserving those records for a limited period of time? How are you going to decide the window size for each key? Is that also dynamically loaded from the file? – maasg Jun 14 '16 at 15:02
  • Hi Maasg, actually the cvd_filter.txt file itself has other fields like the window size besides the filter string "1001" which happens to be the ID in my application. So each given ID has its window size. Overall, i am trying to read the ID-Window size dynamically from cvd_filter.txt file and maintain a window for elements of that ID from original dstream – vkb Jun 14 '16 at 17:31
  • @VinayKumar Why do you need to maintain a window? What happens with the data afterwards? – maasg Jun 14 '16 at 17:33
  • The data in a particular ID's window will be tested for greater/less /equals to some particular value and then a flag will be sent if the condition is true. So, i will have bunch of ID's with its windows and each of the ID window will be tested for a particular inequality test. – vkb Jun 14 '16 at 17:38
  • Window is needed because i need to see if data is continuously greater than say 150 for 4 mins of ID1, lesser than 20 for 3 mins for ID2 and so on...The values 150, 20 etc are also part of cvd_filter.txt .. So it looks like - ID-WindowSize-InequalityValue-InequalityType. – vkb Jun 14 '16 at 17:39
  • @VinayKumar sounds like an usecase for `mapWithState` instead of attempting to handle each filter entry as an individual data stream. – maasg Jun 14 '16 at 18:51
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/114674/discussion-between-vkb-and-maasg). – vkb Jun 14 '16 at 20:10