Exact behavior of ADD COLUMN with CASCADE in Hive

Question

Suppose the next table:

sch_test.test_cascade
    a    b    c
    -----------
    6    10   1
    6    10   1
    6    10   2
    6    10   2

a, b and c are integers and the table is partitioned by c. When I add a column in Hive with CASCADE:

ALTER TABLE sch_test.test_cascade ADD COLUMNS (d int, e int) CASCADE;

it returns the next table:

sch_test.test_cascade
a    b    d    e     c
-------------------------
6    10   1    NULL  1    
6    10   1    NULL  1  
6    10   2    NULL  2  
6    10   2    NULL  2

That is, the first inserted column (d, in this case) takes the values of the partition column (Note that I've added two columns to show that the second one is NULL). But, if I add a column without CASCADE:

ALTER TABLE sch_test.test_cascade ADD COLUMNS (d int, e int);

it returns the next table:

sch_test.test_cascade
a    b    d       e      c
----------------------------
6    10   NULL    NULL   1
6    10   NULL    NULL   1
6    10   NULL    NULL   2
6    10   NULL    NULL   2

That is, both inserted columns are NULL.

I don't understand very well the difference between add a column with CASCADE or RESTRICT, in Hive documentation I can see:

The CASCADE|RESTRICT clause is available in Hive 1.1.0. ALTER TABLE ADD|REPLACE COLUMNS with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column changes only to table metadata.

But it isn't clear to me the difference between "cascades the same change to all the partition metadata" and "limiting column changes only to table metadata". (Does Restrict apply only to one partition and Cascade to all them? The previous example says the opposite). And if that is the reason why the first column inserted (in CASCADE mode) takes the value of the partition column.

Without `CASCADE` the `ALTER` applies only to the _default_ definition i.e. the one that will be used for new partitions, but not to existing partitions. The rationale is that you already have legacy data files with legacy format in legacy partitions. — Samson Scharfrichter, Dec 13 '17 at 19:02
About your test case with the partition key being used also "magically" as the next actual column, well, ahem, I can only guess. Are you sure there wasn't an extra column in the data files, that was hidden because your Hive definition mapped only the first 2 cols?? Can you _tail_ one of these files to be sure? — Samson Scharfrichter, Dec 13 '17 at 19:07
Thank you for you answer. I'm pretty sure because I've written the files by hand, so I think the assumption that there are extra columns doesn't apply, it's still a mystery! Nevertheless your explanation about `CASCADE` and `ALTER` has sense, because only `CASCADE` changes de value of old partitions (but in a strange way). — Amanda, Dec 14 '17 at 09:10
if you don't add cascade then you can't load(historical or backfill) those columns in the already existing partitions. — Sivaji, Nov 02 '18 at 00:22

score -1 · Answer 1 · answered Nov 18 '20 at 07:00

-1

If we don't add cascade during ADD column, the existing partitions would have same definition. This new column would be null in them, even after overwriting the partition with new field. You can drop and recreate the partition to fix this though.

answered Nov 18 '20 at 07:00

Pramod Reddy Baireddy

1

Exact behavior of ADD COLUMN with CASCADE in Hive

1 Answers1