1

here i'm trying to persist the data frame in to a partitioned hive table and getting this silly exception. I have looked in to it many times but not able to find the fault.

org.apache.spark.sql.AnalysisException: Specified partition columns (timestamp value) do not match the partition columns of the table. Please use () as the partition columns.;

Here is the script by which the external table is created with,

CREATE EXTERNAL TABLEIF NOT EXISTS events2 (
action string
,device_os_ver string
,device_type string
,event_name string
,item_name string
,lat DOUBLE
,lon DOUBLE
,memberid BIGINT
,productupccd BIGINT
,tenantid BIGINT
) partitioned BY (timestamp_val DATE) 
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
stored AS inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
location 'maprfs:///location/of/events2' 
tblproperties ('serialization.null.format' = '');

Here is the result of describe formatted of table "events2"

hive> describe  formatted events2;
OK
# col_name              data_type               comment             

action                  string                                      
device_os_ver           string                                      
device_type             string                                      
event_name              string                                      
item_name               string                                      
lat                     double                                      
lon                     double                                      
memberid                bigint                                      
productupccd            bigint                                      
tenantid                bigint                                      

# Partition Information      
# col_name              data_type               comment             

timestamp_val           date                                        

# Detailed Table Information         
Database:               default                                  
CreateTime:             Wed Jan 11 16:58:55 IST 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               maprfs:/location/of/events2  
Table Type:             EXTERNAL_TABLE           
Table Parameters:        
    EXTERNAL                TRUE                
    serialization.null.format                       
    transient_lastDdlTime   1484134135          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    serialization.format    1                   
Time taken: 0.078 seconds, Fetched: 42 row(s)

Here is the line of code where the data is partitioned and stored in to a table,

val tablepath = Map("path" -> "maprfs:///location/of/events2")

AppendDF.write.format("parquet").partitionBy("Timestamp_val").options(tablepath).mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("events2")

While running the application, i'm getting the below

Specified partition columns (timestamp_val) do not match the partition columns of the table.Please use () as the partition columns.

I might be committing an obvious error, any help is much appreciated with an upvote :)

Rijul
  • 1,418
  • 11
  • 21
jack AKA karthik
  • 885
  • 3
  • 15
  • 30
  • I was getting the same error while writing a dataframe to a delta table. Turns out, I had earlier created the table without using any partition column. Later, I tried to write in the same table using a partition column. Hence, the error. What worked for me was creating a new table with 'partitonBy'. – Kishore Kumar Singh Jun 01 '20 at 06:11

1 Answers1

0

Please print schema of df:

AppendDF.printSchema()

Make sure it is not type mismatch??

KiranM
  • 1,306
  • 1
  • 11
  • 20
  • root |-- action: string (nullable = true) |-- device_os_ver: string (nullable = true) |-- device_type: string (nullable = true) |-- event_name: string (nullable = true) |-- item_name: string (nullable = true) |-- lat: double (nullable = true) |-- lon: double (nullable = true) |-- memberid: long (nullable = true) |-- productUpccd: long (nullable = true) |-- tenantid: long (nullable = true) |-- timestamp_val: timestamp (nullable = false) – jack AKA karthik Jan 12 '17 at 06:18
  • You would need to declare Hive table column as TIMESTAMP. – KiranM Jan 12 '17 at 13:42
  • yes kiran i have made the changes, as per my hive version my parquet cannot take timestamp as a column, so i have changed the timestamp in to string in my dataframe like this http://stackoverflow.com/questions/41607192/how-to-convert-current-timestamp-value-in-to-a-string-in-scala – jack AKA karthik Jan 12 '17 at 14:21
  • Now i'm facing a version issue. – jack AKA karthik Jan 12 '17 at 15:08
  • Not clear what version is issue here? As per documentation, from Hive 0.8.0 supports TIMESTAMP. You would lose data if you truncate timestamp to DATE (YYYY-­MM-­DD). If you convert Hive type to STRING, u would have slower performance (& wrong) queries on date string column.. – KiranM Jan 12 '17 at 15:45
  • i'm using parquet format for hive table, which does not accept data or timestamp as it is. – jack AKA karthik Jan 13 '17 at 04:34
  • And moreover the .saveAsTable() is not supported in my hive version. – jack AKA karthik Jan 13 '17 at 04:54