2

Currently we have spark structured streaming

In arrow doc, I found arrow streaming, where we can create a stream in Python, produce the data, and use StreamReader to consume the stream in Java/Scala

I am wondering if there is integration of these two, where we can do something like producing the arrow stream in Python and use spark structured streaming to get the stream (in distributed manner)?


Imagine a scenario, one want to build a easy to use Python api but the computing engine is on Java/Scala, using Kafka/Redis would not solve the data types across the languages. But using arrow there is currently no cluster support to access the data

Litchy
  • 623
  • 7
  • 23

2 Answers2

1

Perhaps not exactly what you're looking for, but Spark 3.3 will have mapInArrow API call - https://github.com/apache/spark/pull/34505

This will not work with streaming though.

Tagar
  • 13,911
  • 6
  • 95
  • 110
  • 1
    this is very close I think. Basically spark just need to create some wrapper of producing data and data receiver. – Litchy Nov 29 '21 at 07:18
0

I have never heard of a project like this. What you described is pretty much PySpark Structured Streaming where you have a running python application on one side talking to the Spark infrastructure running on JVM.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Is a project like this promising, or possible to implement? Basically it is about integrating streaming of cross-language and scalability together. – Litchy Nov 26 '19 at 01:21
  • _"Is a project like this promising"_? Dunno. _"Possible to implement"_? Yes. – Jacek Laskowski Nov 26 '19 at 09:22