Difference Between apache-arrow-flight and apache-kafka (accessing large datasets over a network)

Question

as far as i know, both platform supports big data ingestion(streaming).

What are the advantages and disadvantages of each platform?

Can you clarify your question to something specific? These are pretty different technologies doing different things - there's lots of material out there discussing them in general. Do you have a particular point that you need clarifying? — Robin Moffatt, Jan 10 '20 at 11:15
my question is about what is what. so, i can't clarify my question. — sailfish009, Jan 11 '20 at 01:11
btw i think arrow-flight can replace kafka for some purpose, that's why I asked a question. — sailfish009, Jan 11 '20 at 01:29
No, flight does not persist sent messages in any durable storage. — OneCricketeer, Jan 11 '20 at 03:40

OneCricketeer · Answer 1 · 2020-01-10T14:12:09.300

3

Arrow Flight consists of a serialization format for Arrow over gRPC. It requires two applications, one client, and one server. The server must be running for the client to send it messages.

Apache Kafka is a distributed, persistent, temporal log. It requires 4 components - Zookeeper, Kafka broker, the producer application, and a consumer application. The producer and consumer are decoupled and need not be running at the same time. Zookeeper and the broker must always be available for a healthy system

With Flight, you have point-to-point client server interactions between applications.

With Kafka, applications interact with middleware of the brokers only, not one another.

In theory, one could write an Arrow serializer for Kafka, however I would think using row-oriented formats such as Thrift, Protobuf, Avro make more sense over the network than the popular analytic, columnar formats like Arrow, ORC, Parquet

Neither system is necessarily required for large data sets. In fact, I'm not sure Arrow scales any better than any other gRPC based architecture

The driving force being Kafka is to reduce the point to point application interaction

edited Jan 10 '20 at 14:12

answered Jan 10 '20 at 14:06

OneCricketeer

179,855
19
132
245

2

i think arrow doesn't require serialization/deserialization. – sailfish009 Jan 11 '20 at 01:08
and kafka is hard to integrate a solution wthout using commercial solution like confluent. – sailfish009 Jan 11 '20 at 01:36
@sailfish009 Arrow defines a in-memory data format... over a network, that needs to be serialized somehow... What exactly from Confluent are you talking about? KSQL, Schema Registry, Kafka Connect are all open source, not commercial offerings. – OneCricketeer Jan 11 '20 at 03:38
to me, kafka is very hard to use, anyway. so if i can use apache arrow flight easily, i will choose it. – sailfish009 Jan 11 '20 at 03:54
What makes Kafka hard? You don't have to host it yourself, if that is what you mean. It's also not clear what type of application you actually are making. You could do the same with plain Protobuf, Thrift, Avro, etc than Arrow – OneCricketeer Jan 11 '20 at 03:59
"get the data source over network and feed it to pytorch/tensorflow." – sailfish009 Jan 11 '20 at 04:06
You can use requests to pull HTTP, Kafka producer in Python to send it, then you can have a consumer to pull data into pytorch/tensorflow model... It might take some time, sure. – OneCricketeer Jan 11 '20 at 04:22
3

"I'm not sure Arrow scales any better than any other gRPC based architecture" @cricket_007 I think you may be discounting the impact of serialization on messaging throughput. It would be worth having a more nuanced discussion in a different medium with code examples and benchmarks to support assertions – Wes McKinney Jan 13 '20 at 17:05
@Wes I only stated that given my limited understanding of gPRC systems and Arrow Flight, not making a claim that it is "slow,bloated,doesn't scale *at all*", etc. Sure, the bytes are layed out differently, but I mean as far as deployables goes, there are still some clients and some servers, which must be running together, as compared to using Kafka (or any other message queue). – OneCricketeer Jan 13 '20 at 21:00

Difference Between apache-arrow-flight and apache-kafka (accessing large datasets over a network)

1 Answers1