How can I join a spark live stream with all the data collected by another stream during its entire life cycle?

Question

I have two spark streams, in the first comes data related to products: their price to the supplier, the currency, their description, the supplier id. These data are enriched by the category, guessed by the analysis of the description and the price in dollars. Then they are saved in a parquet dataset.

The second stream contains data on the auctioning of these products, then the cost at which they were sold and the date.

Given the fact that a product can arrive in the first stream today and be sold in a year, how can I join the second stream with all the history contained in the parquet dataset of the first stream?

The result to be clear should be the average daily earnings per price range ...

score 1 · Accepted Answer · answered Jan 22 '18 at 10:23

1

I found a possible solution with snappydata, using its mutable DataFrame:

https://www.snappydata.io/blog/how-mutable-dataframes-improve-join-performance-spark-sql

The reported example is very similar to the one described by claudio-dalicandro

answered Jan 22 '18 at 10:23

giorrrgio

66
5

This is _exactly_ what I need – Claudio D'Alicandro Jan 22 '18 at 16:41

score 0 · Answer 2 · answered Jan 18 '18 at 06:51

0

If you are leveraging structured streaming in Spark then you can load the parquet files of first stream into a dataframe.

parquetFileDF = spark.read.parquet("products.parquet")

Then you can get your second stream and join with the parquet file.

streamingDF = spark.readStream. ...
streamingDF.join(parquetFileDF, "type", "right_join")

Even you can join with your first stream to second stream as well.

Hope, this helps.

answered Jan 18 '18 at 06:51

Gourav Dutta

533
4
10

Unfortunately I'm using the classic streams, but even if it were not so with your solution I would not consider the updates of the parquet, which is rewritten with a window of 30 minutes from the first stream... If there was a way to read the dataframe every 30 minutes would be great! Alternatively it would be enough for me to be able to join two structured streams... – Claudio D'Alicandro Jan 18 '18 at 09:52

How can I join a spark live stream with all the data collected by another stream during its entire life cycle?

2 Answers2