0

I have a structured stream dataframe tempDataFrame2 consisting of Field1. I am trying to calculate the approxQuantile of Field1. However, whenever I type

val Array(Q1, Q3) = tempDataFrame2.stat.approxQuantile("Field1", Array(0.25, 0.75), 0.0) I get the following error message:

Queries with streaming sources must be executed with writeStream.start()

Below is the code snippet:

val tempDataFrame2 = A structured streaming dataframe

// Calculate IQR
val Array(Q1, Q3) = tempDataFrame2.stat.approxQuantile("Field1", Array(0.25, 0.75), 0.0)

// Filter messages
val tempDataFrame3 = tempDataFrame2.filter("Some working filter")

val query = tempDataFrame2.writeStream.outputMode("append").queryName("table").format("console").start()
query.awaitTermination()

I have already went through this two links from SO: Link1 Link2. Unfortunately, I am not able to relate those responses with my problem.

Edit

After reading the comments, following is the way I am planning to go ahead with:

1) Read all the uncommitted offset from the Kafka topic. 2) Save them to a dataframe variable. 3) Stop the structured streaming so that I don't read from the Kafka topic anymore. 4) Start processing the saved dataframe from step 2).

But, now I am not sure how to go ahead -

1) like how to know that I don't have any other records to consume in the Kafka topic and stop the streaming query?

zero323
  • 322,348
  • 103
  • 959
  • 935
user3243499
  • 2,953
  • 6
  • 33
  • 75
  • What would be the meaning of `approxQuantile` on a stream? It returns a single set of values. – Alper t. Turker Jun 11 '18 at 19:43
  • @user8371915: So I should not be using it on a structured stream? – user3243499 Jun 11 '18 at 19:50
  • I am streaming like 200000+ records, on which I am trying to find approxQuantile. – user3243499 Jun 11 '18 at 19:51
  • Does it mean data is of bounded size? Why use structured streaming in that case? – Alper t. Turker Jun 11 '18 at 19:59
  • If not then please suggest an alternative approach. I am new to spark. – user3243499 Jun 11 '18 at 20:12
  • Or, if it is possible to do the same in structured streaming some how. – user3243499 Jun 11 '18 at 20:13
  • Do you need a streaming dataframe? Wouldn't a normal one work? – Shaido Jun 12 '18 at 01:32
  • Have you solved this issue? I would like to use the dfs.stat.approxQuantile on spark streaming dataset and I have tghe same issue – florins Oct 28 '19 at 10:54
  • You have to do operations on the streaming query itself. Otherwise, save all your records to disk first (maybe HDFS or NFS, or s3) and then calculated your desired score. It's pointless to calculate the score on a stream which depends on other records which were already received or are about to receive as in a stream you are getting only a micro-batch or records (maybe 3, 4, ...., 10). – user3243499 Oct 28 '19 at 12:39

0 Answers0