8

I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression.

I have three columns with string values, one column called "key" with int values and one column called "result" which have both double and int values.

With those columns, I created Schema like:

create external table (
    key int,
    result double,
    location string,
    vehicle_name string.
    filename string
)

When I queried the table, I would get

HIVE_BAD_DATA: Field results type INT64 in parquet is incompatible with type DOUBLE defined in table schema

So, I modified a schema with result datatype as INT.

Then I queried the table and got,

HIVE_BAD_DATA: Field results type DOUBLE in parquet is incompatible with type INT defined in table schema

I've looked around to try to understand why this might happen but found no solution.

Any suggestion is much appreciated.

Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
Sarathy Velmurugan
  • 123
  • 1
  • 2
  • 10

1 Answers1

13

It sounds to me like you have some files where the column is typed as double and some where it is typed as int. When you type the column of the table as double Athena will eventually read a file where the corresponding column is int and throw this error, and vice versa if you type the table column as int.

Athena doesn't do type coercion as far as I can tell, but even if it did, the types are not compatible: a DOUBLE column in Athena can't represent all possible values of a Parquet INT64 column, and an INT column in Athena can't represent a floating point number (and a BIGINT column is required in Athena for a Parquet INT64).

The solution is to make sure your files all have the same schema. You probably need to be explicit in the code that produces the files about what schema to produce (e.g. make it always use DOUBLE).

Theo
  • 131,503
  • 21
  • 160
  • 205
  • 1
    I wonder if there's a way to salvage data that has been saved in multiple formats like this. Maybe some way on the query side of things to make it more permissive or consider all versions of the Glue table before saying a file can't be read. Any information about how to still get use from data like this would be great. – trademark Feb 19 '21 at 17:29
  • @trademark you would' have to read it file by file using Spark and cast it into the type you want. It's a bit of programming, but it's not that bad. You can use S3 SDK to list objects, load them, and transform them one by one. – Hedrack Aug 19 '21 at 04:43