1

I have around 34 million rows, each with 23 columns in the store_sales table in the tpc-ds dataset .

I have a composite primary key of columns ss_item_sk and ss_ticket_number.

Upon running the query SELECT count(DISTINCT <primary key>) .. I'm able to see that it outputs the total number of rows present in the table.

Now I add another column along with the primary key, namely ss_sold_date_sk.

After this, if I run the countquery, I get fewer number of rows being printed than before. Can someone explain to me, via example, why this could happen?

TL;DR

When does adding a column to a composite primary key ever stop making it unique?

vladimir
  • 13,428
  • 2
  • 44
  • 70
Zzrot
  • 304
  • 2
  • 4
  • 20

1 Answers1

2

Demo

create table mytable (c1 string,c2 string);
insert into mytable values ('A','A'),('B',null),('C',null);

select count(distinct c1) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              3 |
+----------------+

As Expected - 3 distinct values - 'A','B' and 'C'


select count(distinct concat(c1,c2)) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              1 |
+----------------+
    

As expected. Why? - see next query


select c1,c2,concat(c1,c2) as concat_c1_c2 from mytable;

+----+------+--------------+
| c1 |  c2  | concat_c1_c2 |
+----+------+--------------+
| A  | A    | AA           |
| B  | NULL | NULL         |
| C  | NULL | NULL         |
+----+------+--------------+

Concatenation with NULL yields NULL


select count(distinct c1,c2) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              1 |
+----------------+

BUG!!


Here is a work-around the bug:

select count(distinct struct(c1,c2)) as count_distinct from mytable;

+----------------+
| count_distinct |
+----------------+
|              3 |
+----------------+
Community
  • 1
  • 1
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
  • IMHO it's working as documented: *COUNT(DISTINCT expr,[expr...]) Returns a count of the number of rows with different non-NULL expr values. * – dnoeth Feb 23 '17 at 23:11
  • 1
    @dnoeth - First of all I can change the documentation at this moment. :-) Second, the UDF documentation (cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF‌​) states "count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.". It doesn't state, nor your quote, that **all** the expressions in a row should be non-NULL. – David דודו Markovitz Feb 24 '17 at 04:38
  • And most important - if `select distinct ...` treats the columns' combination as different so should `select count ( distinct ...)` - therefore bug – David דודו Markovitz Feb 24 '17 at 06:40
  • Standard SQL doesn't define `COUNT(DISTINCT )` on multiple columns, so a vendor can probably implement whatever suits him best :-) – dnoeth Feb 24 '17 at 09:56