I am working on an application where I want to run Flink SQL on real time events and past events.I tried a POC where Flink runs SQL on streaming sources such as Kafka, SQL query only returns new events / changes. But, I want to run SQL on whole data, some data may be changing over time. Basically, my requirement is to continuously query the whole data. How to achieve this with Flink or any other streaming solution ?
Asked
Active
Viewed 407 times
2
-
Is there a semantic difference between the real-time data and the historical data, are they the same thing but one is old and one is new? Or are they two very different data sources, semantically speaking? – Chris Gerken Dec 22 '19 at 13:55
-
They are from same streaming source and there are no semantic difference between real time data and historical data. – Dec 22 '19 at 13:59
-
Why not use Kafka for both ?? – Dominik Wosiński Dec 22 '19 at 18:16
-
Despite what marketing messages might suggest, Kafka is inherently a message bus and not a data store. It will be painful to use it as a historical data store as soon as the volume increases. – Dennis Jaheruddin Dec 26 '19 at 17:54
-
Agree with Dennis. Kafka as a persistent store for historical data doesn't seems good idea. Any thoughts on Pravega http://www.pravega.io/ ? – Dec 28 '19 at 20:03
2 Answers
0
Flink SQL doesn't yet offer a proper filesystem connector, so that makes this problematic, at least for now. Kafka, on the other hand, is well supported.

David Anderson
- 39,434
- 4
- 33
- 60
0
If you want a statefull backend that scales well as the history grows, it may be good to look at the available connectors.
The most likely candidate for this seems to be Hbase.
So far the general answer.
It is probably best to just start from here, but for your wish to use S3, it may be good to know that the Cloudera Data Platform will soon include an S3 backed Hbase solution.
Disclaimer: I am an employee of Cloudera, a driving force behind Kafka, Hbase and soon Flink

Dennis Jaheruddin
- 21,208
- 8
- 66
- 122