How does Dremel or its implementation (say Drill) handle large columnar data layout in memory?

Question

I am going through the white paper of Google Dremel. I came to know it converts complex data into columnar data layout.

At what location is this data stored?

As Drill has no central metadata repository, I assume it must be in-memory.

Therefore how does Drill handle this data when I have billions of rows?

catpaws · Accepted Answer · 2015-08-28T20:23:09.630

2

To get complete, consistent query results from billions of rows, you'll use a distributed file system connected to multiple Drillbits, simulate a distributed file system by copying files to each node, or use an NFS volume, such as Amazon Elastic File System. Drill performs performant querying of big data using a number of techniques, including these:

Relies on the cluster nodes to handle failures (doesn't spend time on failure-related tasks).
Uses an in-memory data model that's hierarchical and columnar (doesn't access the disk for columns that are not involved in an analytic query, processing the columnar data without row materialization).
Uses columnar storage optimizations and execution (keeps memory footprint low).
Uses vectorization to work on arrays of values from different records rather than single values from one record at a time.

For more information, see http://drill.apache.org/docs/performance/.

edited Aug 28 '15 at 20:23

answered Aug 28 '15 at 17:56

catpaws

2,263
16
18

What drill will do when its internal memory is full? After processing millions of rows its internal memory may be full.. – Dev Aug 31 '15 at 03:04
I think Drill uses disk to store intermediate results when memory is full. I've heard reports of happy users who were able to run huge queries that took hours to complete, but did not run on other software. A community member offered terabyte testing facilities at the Drill Tuesday user group hangout (see http://apache.github.io/drill/community-resources/ for link), open to all, so I hope we'll have performance data soon. – catpaws Sep 01 '15 at 14:00
http://apache.github.io/drill/docs/drill-query-execution/#execution-of-minor-fragments and other sections on the page might be helpful to explain the complex query execution process. – catpaws Sep 01 '15 at 14:11

How does Dremel or its implementation (say Drill) handle large columnar data layout in memory?

1 Answers1