Pandas dataframe type datetime64[ns] is not working in Hive/Athena

Question

I am working on a python application which just converts csv file to hive/athena compatible parquet format and I am using fastparquet and pandas libraries to perform this. There are timestamp values in csv file like 2018-12-21 23:45:00 which needs to be written as timestamp type in parquet file . Below is my code that am running ,

columnNames = ["contentid","processed_time","access_time"]

dtypes = {'contentid': 'str'}

dateCols = ['access_time', 'processed_time']

s3 = boto3.client('s3')

obj = s3.get_object(Bucket=bucketname, Key=keyname)

df = pd.read_csv(io.BytesIO(obj['Body'].read()), compression='gzip', header=0, sep=',', quotechar='"', names = columnNames, error_bad_lines=False, dtype=dtypes, parse_dates=dateCols)

s3filesys = s3fs.S3FileSystem()

myopen = s3filesys.open

write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)

the code ran successfully , below is the dataframe created by pandas

contentid                 object
processed_time            datetime64[ns]
access_time               datetime64[ns]

And finally , when i queried the parquet file in Hive and athena , the timestamp value is +50942-11-30 14:00:00.000 instead of 2018-12-21 23:45:00

Any help is highly appreciated

Try converting the columns to data time format while inserting in hive pd.to_datetime(df['access_time', 'processed_time'], unit='ms', errors='coerce') — theMerakist, Dec 25 '18 at 06:33
Don't parse columns while creating DF, instead convert to datetime object as datetime.datetime.strptime('2018-12-21 23:45:00','%y-%m-%d %H:%m') and apply to date columns of df. — theMerakist, Dec 25 '18 at 06:57
You might need to use a function from here https://prestodb.io/docs/current/functions/datetime.html depending on the format generated by your python script if it is not directly supported by Athena/Hive.Use "parquet-tools cat" to check the data and schema for schema.Post the timestamp format here if you are not able to find correct function for conversion. — Prabhakar Reddy, Dec 29 '18 at 06:02
Did any of these answers help, I'm running into the exact same issue? — Severun, May 24 '19 at 19:29

Ditlev Stjerne · Answer 1 · 2020-04-17T12:17:58.490

7

I know this question is old but it is still relevant.

As mentioned before Athena only supports int96 as timestamps. Using fastparquet it is possible to generate a parquet file with the correct format for Athena. The important part is the times='int96' as this tells fastparquet to convert pandas datetime to int96 timestamp.

from fastparquet import write
import pandas as pd

def write_parquet():
  df = pd.read_csv('some.csv')
  write('/tmp/outfile.parquet', df, compression='GZIP', times='int96')

edited Apr 17 '20 at 12:17

answered Apr 17 '20 at 12:07

Ditlev Stjerne

71
1
4

Very helpful! Thanks so much! – Joshua Aug 26 '20 at 14:36

Hely Andrés Palencia · Answer 2 · 2020-08-11T19:12:38.323

2

I solved the problem by this way.

tranforms the df series with to_datetime method

next with a .dt accesor pick the date part of the datetime64[ns]

Example:

df.field = pd.to_datetime(df.field)
df.field = df.field.dt.date

After that, athena will recognize the data

edited Aug 11 '20 at 19:12

answered Jun 09 '20 at 23:33

Hely Andrés Palencia

63
7

score 2 · Answer 3 · answered Sep 22 '20 at 07:14

2

You could try:

dataframe.to_parquet(file_path, compression=None, engine='pyarrow', allow_truncated_timestamps=True, use_deprecated_int96_timestamps=True)

answered Sep 22 '20 at 07:14

Nguyễn Văn Thưởng

63
8

score 0 · Answer 4 · answered May 24 '19 at 21:56

The problem seems to be with Athena, it only seems to support int96 and when you create a timestamp in pandas it is an int64

my dataframe column that contains a string date is "sdate" I first convert to timestamp

# add a new column w/ timestamp
df["ndate"] = pandas.to_datetime["sdate"]
# convert the timestamp to microseconds
df["ndate"] = pandas.to_datetime(["ndate"], unit='us')

# Then I convert my dataframe to pyarrow
table = pyarrow.Table.from_pandas(df, preserve_index=False)

# After that when writing to parquet add the coerce_timestamps and 
# use_deprecated_int96_timstamps. (Also writing to S3 directly)
OUTBUCKET="my_s3_bucket"

pyarrow.parquet.write_to_dataset(table, root_path='s3://{0}/logs'.format(OUTBUCKET), partition_cols=['date'], filesystem=s3, coerce_timestamps='us', use_deprecated_int96_timestamps=True)

score 0 · Answer 5 · answered Apr 01 '22 at 01:44

I also got this problem for multiple times. My error code is I set the index to datetime format by:

df.set_index(pd.DatetimeIndex(df.index), inplace=True)

When I then read the parquet file by fastparquet it may notice me that

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 219968-03-28 05:07:11

However, it can be easily solved by using pd.read_parquet(path_file) rather than fastparquet.ParquetFile(path_file).to_pandas()

PLEASE USE pd.read_parquet(path_file) TO FIX THIS PROBLEM

That's my solution and it works well, hope it may help you then you don't need to worry about how to write parquet in which way.

score -1 · Answer 6 · edited May 07 '19 at 15:00

I was facing the same problem, after a lot of research, it is solved now.

when you do

write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)

it uses fastparquet behind the scene, which uses a different encoding for DateTime than what Athena is compatible with.

the solution is: uninstall fastparquet and install pyarrow

pip uninstall fastparquet
pip install pyarrow

run your code again. It should work this time. :)

Pandas dataframe type datetime64[ns] is not working in Hive/Athena

6 Answers6

Linked