DATABRICKS SQL - can't read data from partitioned parquet file

Question

I'm trying to read parquet files structured as:

filename/year=2020/month=12/day=1

files are under the following Mounted AzureStorage as following logic: /mnt/silver/root_folder/folder_A/parquet/year=2020/month=01/day=1

I'm trying to create a table, using this sintax:

CREATE  TABLE tablename
(
  FIELD1 string,
  ...
  ,FIELDn Date
  ,Year INT
  ,Month INT
  ,Day INT
)
USING org.apache.spark.sql.parquet 
LOCATION '/mnt/silver/root_folder/folder_A/parquet/'
OPTIONS(  'compression'='snappy')
PARTITIONED BY (Year, Month, Day)

But all options I tried for LOCATION gets no Results. I already tried: /mnt/silver/folder/folder/parquet/* and also many variations of it.

Any suggestion please?

score 2 · Answer 1 · answered Feb 24 '21 at 17:38

You need to execute MSCK REPAIR TABLE <table_name> or ALTER TABLE <table_name> RECOVER PARTITIONS - any of them forces to re-discover data in partitions.

From documentation:

When creating a table using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore

P.S. when you use Delta, that's done automatically, so that's one of the good reasons for using it :-)

score 0 · Answer 2 · answered Jul 28 '23 at 10:51

Maybe a lot has changed since you have posted your question.

I am using partitioned snappy compressed parquet files and managed to create a table - regarding your given example /mnt/silver/root_folder/folder_A/parquet/year=2020/month=01/day=1 - in Azure Databricks like this

Option 1: CREATE TABLE USING

CREATE TABLE tablename
(
   field1 STRING
  ,...
  ,fieldn DATE
  ,year INT
  ,month INT
  ,day INT
)
USING PARQUET
PARTITIONED BY (year, month, day)
LOCATION '/mnt/silver/root_folder/folder_A/parquet/';

Option 2: CREATE TABLE with Hive format

CREATE TABLE tablename
(
   field1 STRING
  ,...
  ,fieldn DATE
  ,year INT
  ,month INT
  ,day INT
)
PARTITIONED BY (year, month, day)
STORED AS PARQUET
LOCATION '/mnt/silver/root_folder/folder_A/parquet/';

However, I'm not sure, if month=01 would be interpreted as INT.

DATABRICKS SQL - can't read data from partitioned parquet file

2 Answers2