-2

as far as i know, both platform supports big data ingestion(streaming).

What are the advantages and disadvantages of each platform?

sailfish009
  • 2,561
  • 1
  • 24
  • 31

1 Answers1

3

Arrow Flight consists of a serialization format for Arrow over gRPC. It requires two applications, one client, and one server. The server must be running for the client to send it messages.

Apache Kafka is a distributed, persistent, temporal log. It requires 4 components - Zookeeper, Kafka broker, the producer application, and a consumer application. The producer and consumer are decoupled and need not be running at the same time. Zookeeper and the broker must always be available for a healthy system


With Flight, you have point-to-point client server interactions between applications.

With Kafka, applications interact with middleware of the brokers only, not one another.


In theory, one could write an Arrow serializer for Kafka, however I would think using row-oriented formats such as Thrift, Protobuf, Avro make more sense over the network than the popular analytic, columnar formats like Arrow, ORC, Parquet


Neither system is necessarily required for large data sets. In fact, I'm not sure Arrow scales any better than any other gRPC based architecture

The driving force being Kafka is to reduce the point to point application interaction

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • 2
    i think arrow doesn't require serialization/deserialization. – sailfish009 Jan 11 '20 at 01:08
  • and kafka is hard to integrate a solution wthout using commercial solution like confluent. – sailfish009 Jan 11 '20 at 01:36
  • @sailfish009 Arrow defines a in-memory data format... over a network, that needs to be serialized somehow... What exactly from Confluent are you talking about? KSQL, Schema Registry, Kafka Connect are all open source, not commercial offerings. – OneCricketeer Jan 11 '20 at 03:38
  • to me, kafka is very hard to use, anyway. so if i can use apache arrow flight easily, i will choose it. – sailfish009 Jan 11 '20 at 03:54
  • What makes Kafka hard? You don't have to host it yourself, if that is what you mean. It's also not clear what type of application you actually are making. You could do the same with plain Protobuf, Thrift, Avro, etc than Arrow – OneCricketeer Jan 11 '20 at 03:59
  • "get the data source over network and feed it to pytorch/tensorflow." – sailfish009 Jan 11 '20 at 04:06
  • You can use requests to pull HTTP, Kafka producer in Python to send it, then you can have a consumer to pull data into pytorch/tensorflow model... It might take some time, sure. – OneCricketeer Jan 11 '20 at 04:22
  • 3
    "I'm not sure Arrow scales any better than any other gRPC based architecture" @cricket_007 I think you may be discounting the impact of serialization on messaging throughput. It would be worth having a more nuanced discussion in a different medium with code examples and benchmarks to support assertions – Wes McKinney Jan 13 '20 at 17:05
  • @Wes I only stated that given my limited understanding of gPRC systems and Arrow Flight, not making a claim that it is "slow,bloated,doesn't scale *at all*", etc. Sure, the bytes are layed out differently, but I mean as far as deployables goes, there are still some clients and some servers, which must be running together, as compared to using Kafka (or any other message queue). – OneCricketeer Jan 13 '20 at 21:00