Changing column datatype in parquet file

Question

I have an external table pointing to an s3 location (parquet file) which has all the datatypes as string. I want to correct the datatypes of all the columns instead of just reading everything as a string. when i drop the external table and recreate with new datatypes, the select query always throws error which looks something like below:

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
    at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
    at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

Please share your 'create table' syntax and the query which leads to the above error — Satya Pavan, Mar 31 '18 at 22:24
@querymind I think when you use external table you can't change the datatype of your column to something else if this one is originaly String — HISI, Apr 01 '18 at 14:37
@SatyaPavan My original table is something like this:` CREATE EXTERNAL TABLE `test_database`.`test_table`(`id` string, `place` string, `total` string) COMMENT 'This is a test table. ' PARTITIONED BY (`date` string) STORED AS PARQUET LOCATION 's3 location' ` I dropped this table and created new one like: ` CREATE EXTERNAL TABLE `test_database`.`test_table`(`id` bigint, `place` string, `total` bigint) COMMENT 'This is a test table. ' PARTITIONED BY (`date` date) STORED AS PARQUET LOCATION 's3 location' ` — querymind, Apr 01 '18 at 14:41
@querymind Try keep datatype of original external table, and then try to change it to the datatype you need by: `ALTER TABLE table CHANGE col col type;` I think that should work — HISI, Apr 01 '18 at 14:45
@hisi that doesn't work i get an error . `org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'id' with type 'StringType' to 'id' with type 'LongType'; ` i am not sure if the alter table command works different for partitioned table — querymind, Apr 01 '18 at 16:39
@querymind you maybe need to create a managed table from your externel table: `create table tablename as select * from yourtable` , and then you can run alter, try this one — HISI, Apr 01 '18 at 16:53
Have you checked this link - [https://issues.apache.org/jira/browse/SPARK-17557] — Satya Pavan, Apr 02 '18 at 03:16
I am facing this problem. Has anyone figured out a solution? — Gladiator, Mar 16 '21 at 14:38

HISI · Answer 1 · 2018-04-02T13:25:54.543

-1

Specify type as BigInt which is Equivalent to long type,hive don't have long datatype.

hive> alter table table change col col bigint;

Duplicate content, from Hortonworks forum

edited Apr 02 '18 at 13:25

answered Apr 02 '18 at 13:20

HISI

4,557
4
35
51

Changing column datatype in parquet file

1 Answers1