-1

following is my sample csv file.

id,name,gender
1,isuru,male
2,perera,male
3,kasun,male
4,ann,female

i converted above csv file into apache parquet using pandas library. following is my code.

import pandas as pd
    
df = pd.read_csv('./data/students.csv')
df.to_parquet('students.parquet')

after that i uploaded the parquet file into the s3 and created a external table like below.

create external table imp.s1 (
id integer,
name varchar(255),
gender varchar(255)
)
stored as PARQUET 
location 's3://sample/students/';

after that i just run select query, but i got following error.

select * from imp.s1

Spectrum Scan Error. File 'https://s3.ap-southeast-2.amazonaws.com/sample/students/students.parquet' 
has an incompatible Parquet schema for column 's3://sample/students.id'. 
Column type: INT, Parquet schema:\noptional int64 id [i:0 d:1 r:0] 
(s3://sample/students.parquet)

Could you please help me to figure out what's the problem in here ?

WAEX
  • 115
  • 1
  • 9

1 Answers1

2

For NULLable integer values, Pandas use the dtype Int64 that correspond to Bigint in Parquet Amazon S3.

Parquet Amazon S3 File Data Type Transformation Description
Int32 Integer -2,147,483,648 to 2,147,483,647 (Precision of 10, scale of 0)
Int64 Bigint -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (Precision of 19, scale of 0)

You need to explicitly set the column type of id when calling pandas.read_csv.

df = pd.read_csv('./data/students.csv', dtype={'id': 'int32'})
Timeless
  • 22,580
  • 4
  • 12
  • 30