What is the difference between Apache Drill's ValueVectors and Apache Arrow?

Question

Apache Drill has its own columnar representation like Apache Arrow. But Apache Arrow has support for more programming languages. I am looking forward to use Apache Drill but still I want the programming language support of Apache Arrow.

Some sources say that, Apache Arrow has its roots in Apache Drill's ValueVectors.

Drill represents data internally as JSON documents – similar to MongoDB and Elasticsearch. These JSON documents are "shredded" into columns, which allows Drill to deliver the performance enhancements of columnar analytics but retain the ability to query complex data. Note, this internal representation is not based on Apache Arrow. - Source

Why cannot Apache Drill make use of the Apache Arrow project? How is Drill's internal representation differ from Apache Arrow and what advantages Arrow has over Drill's ValueVectors and vice-versa.

Wes McKinney · Answer 1 · 2018-12-04T15:56:39.790

The Apache Arrow Java library started out as a fork of Drill's ValueVectors as the Apache Arrow project began at the beginning of 2016. The memory representation is nearly the same; one significant difference is that Arrow uses 1 bit to represent whether a vector slot is null, will Drill uses 1 byte. We decided to change this for reasons of memory efficiency and for using popcount intrinsic operations to check whether a batch of values contain any nulls.

It has been discussed whether to use exactly Arrow's representation in Apache Drill, but there is no timeline for this to happen. The relevant issue is https://issues.apache.org/jira/browse/DRILL-4455

Apache Arrow has been developed as an open standard with a public API in many programming languages. We have some level of support now for 11 programming languages, either through native implementations or bindings. This include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

I am not aware of any performance analysis comparing the memory representations, but the difference relating to null representation is unlikely to cause a significant difference.

score 1 · Answer 2 · answered Dec 25 '18 at 21:46

Drill's community is considering to move onto Apache Arrow. Please take a look the following tickets: https://issues.apache.org/jira/browse/ARROW-3164
https://issues.apache.org/jira/browse/DRILL-4455

But it is on hold right now, since there were a lot of changes and improvements in both projects. So there are some differences in Terminology, Metadata Notation, Data Types, Data Layout..
You can reply to this mail thread in drill dev mailing list to discuss it further: https://lists.apache.org/thread.html/8d895fb40702f3120532f15594ea935a818ac0eb5acdb4fd1248d89f@%3Cdev.drill.apache.org%3E
Also contributions are very welcome :)

What is the difference between Apache Drill's ValueVectors and Apache Arrow?

2 Answers2