0

I want to test CREATE TABLE with PARTITION BY HASH in KUDU

This is my CREATE clause.

CREATE TABLE customers (
  state STRING,
  name STRING,
  purchase_count int,
  PRIMARY KEY (state, name)
)
PARTITION BY HASH (state) PARTITIONS 2
STORED AS KUDU
TBLPROPERTIES (
  'kudu.master_addresses' = '127.0.0.1',
  'kudu.num_tablet_replicas' = '1'
)

Some inserts...

insert into customers values ('madrid', 'pili', 8);
insert into customers values ('barcelona', 'silvia', 8);
insert into customers values ('galicia', 'susi', 8);

Avoiding issues...

COMPUTE STATS customers;
Query: COMPUTE STATS customers
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 3 column(s). |
+-----------------------------------------+

And then...

show partitions customers;
Query: show partitions customers
+--------+-----------+----------+----------------+------------+
| # Rows | Start Key | Stop Key | Leader Replica | # Replicas |
+--------+-----------+----------+----------------+------------+
| -1     |           | 00000001 | hidra:7050     | 1          |
| -1     | 00000001  |          | hidra:7050     | 1          |
+--------+-----------+----------+----------------+------------+
Fetched 2 row(s) in 2.31s

Where my rows are? What means the "-1"?

There is any way to see if row distribution is workings properly?

icalvete
  • 987
  • 2
  • 16
  • 50

1 Answers1

0

Based further research presented in this white-paper https://kudu.apache.org/kudu.pdf The COMPUTE STATS statement works with partitioned tables that use HDFS not for Kudu tables, although Kudu does not use HDFS files internally Impala’s modular architecture allows a single query to transparently join data from multiple different storage components. For example, a text log file on HDFS can be joined against a large dimension table stored in Kudu.

For queries involving Kudu tables, Impala can delegate much of the work of filtering the result set to Kudu, avoiding some of the I/O involved in full table scans of tables containing HDFS data files. This type of optimization is especially effective for partitioned Kudu tables, where the Impala query WHERE clause refers to one or more primary key columns that are also used as partition key columns.