2

I have a Folder which previously had subfolders based on ingestiontime which is also the original PARTITION used in its Hive Table.

So the Folder Looks as -

s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
........

Inside each ingestiontime folder, data is present in PARQUET format.

Now in the Same myStreamingData folder, I am adding another folder that holds similar data but in the folder named businessname.

So my Folder structure now looks like -

s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....
    ........

So I need to add the data in the businessname partition to my current hive table too.

To achieve this , I was running the ALTER Query - ( on Databricks)

%sql
alter table gp_hive_table add partition (businessname=007,ingestiontime=20200712230000) location "s3://MyDevBucket/dev/myStreamingData/businessname=007/ingestiontime=20200712230000"

But I am getting this error -

Error in SQL statement: AnalysisException: businessname is not a valid partition column in table `default`.`gp_hive_table`.;

What part I am doing incorrectly here ?

Thanks in Advance.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Golokesh Patra
  • 578
  • 8
  • 24

3 Answers3

2

Since you're already using Databricks and this is a streaming use case, you should definitely take a serious look at using Delta Lake tables.

You won't have to mess with explicit ... ADD PARTITION and MSCK statements. Delta Lake with the ACID properties will ensure your data is committed properly, if your job fails you won't end up with partial results. As soon as the data is committed, it is available to users (again without the MSCK and ADD PARTITION) statements.

Just change 'USING PARQUET' to 'USING DELTA' in your DDL.

You can also (CONVERT) your existing parquet table to a Delta Lake table and then start using INSERT, UPDATE, DELETE, MERGE INTO, COPY INTO, from Spark batch and structured streaming jobs. OPTIMIZE will clean up the small file problem.

Douglas M
  • 1,035
  • 8
  • 17
  • 1
    Thanks @Douglas, I am also using Databricks Delta Lake Tables. But here my intention is to process and support old historical data and apps (like 3-4 Yrs Old) , which was not ingested according to rather new databricks delta. But your comment has given me a better idea, will try and let you know. – Golokesh Patra Aug 03 '20 at 07:14
  • @GolokeshPatra Ok, understood. I mis-read earlier question and your answer. What's the purpose of adding business name? Is it to facilitate query performance or data protection? I have found that partitions in the meta store may cause your read performance to suffer. Delta Lake tables don't need them (in the metastore). – Douglas M Aug 03 '20 at 14:04
1

alter table gp_hive_table add partition is to add partition(data location, not new column) to the table with already defined partitioning scheme, it does not change current partitioning scheme, it just adds partition metadata, that in some location there is partition corresponding to some partitioning column value.

If you want to change partition columns, you need to recreate the table.:

  1. Drop (check it is EXTERNAL) the table: DROP TABLE gp_hive_table;

  2. Create table with new partitioning column. Partitions WILL NOT be created automatically.

  3. Now you can add partitions using ALTER TABLE ADD PARTITION or use MSCK REPAIR TABLE to create them automatically based on directory structure. Directory structure should already match partitioning scheme before you execute these commands

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thanks @leftjoin, But If you see carefully, My folder/directory structure is not uniform. Amongst all the ingestiontime=... folders , I have multiple businessname/ingestiontime=... folders as well. So I want to add new partitions to the already created hive table. I am not trying to add a new column, I am trying to add new partitions to the already present hive external table i.e to add new tables businessname/ingestiontime=.. in its metadata. – Golokesh Patra Jul 14 '20 at 04:39
  • @GolokeshPatra if you are not trying to add new column, then why you are using TWO cols in partition spec : `partition (businessname=007,ingestiontime=20200712230000)` ? Use the same partitioning as defined in table DDL and add ANY directory as partition – leftjoin Jul 14 '20 at 07:35
0

So, building upon the suggestion from @leftjoin,

Instead of having a hive table without businessname as one of the partition , What I did is -

Step 1 -> Create hive table with - PARTITION BY (businessname long,ingestiontime long)

Step 2 -> Executed the query - MSCK REPAIR <Hive_Table_name> to auto add partitions.

Step 3 -> Now, there are ingestiontime folders which are not in the folder businessname i.e folders like -

s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200712230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200711230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200710230000/....
s3://MyDevBucket/dev/myStreamingData/ingestiontime=20200709230000/....

I wrote a small piece of code to fetch all such partitions and then ran the following query for all of them - ALTER TABLE <hive_table_name> ADD PARTITION (businessname=<some_value>,ingestiontime=<ingestion_time_partition_name>) LOCATION "<s3_location_of_all_partitions_not_belonging_to_a_specific_businesskey>

This solved my issue.

Golokesh Patra
  • 578
  • 8
  • 24