0

I'm trying to read a parquet file using boto3. The original file has dates with the following format:

2016-12-07 23:00:00.000

And they are stored as timestamps.

My code in Sage Maker is:

boto_s3 = boto3.client('s3') 
r = boto_s3.select_object_content(
        Bucket='bucket_name',
        Key='path/file.gz.parquet',
        ExpressionType='SQL',
        Expression=f"select fecha_instalacion,pais from s3object s ",
        InputSerialization = {'Parquet': {}},
        OutputSerialization = {'CSV': {}},
)


rl0 = list(r['Payload'])[0]

from io import StringIO
string_csv = rl0['Records']['Payload'].decode('ISO-8859-1')
csv = StringIO(string_csv)
pd.read_csv(csv, names=['fecha_instalacion',  'pais'])

But instead of the date I get:

fecha_instalacion  pais
45352962065516692798029824  ESPAƃA

I loooked for dates with only one day in between and the nyuumber of digits that are the same are the first 6. As an example:

45337153205849123712294912-->   2016-12-09 23:00:00.000
45337116312360976293191680-->   2016-12-07 23:00:00.000

I would need to get the correct formated date, and avoid the especial characters.

Thanks.

GabyLP
  • 3,649
  • 7
  • 45
  • 66

1 Answers1

2

The problem is the format. That Parquet file is using Int96 numbers to represent timestamp.

Here is a function to convert the int96Timestamp to python DateTime

import datetime

def dateFromInt96Timestamp(int96Timestamp):
    julianCalendarDays = int96Timestamp >> 8*8
    time = int((int96Timestamp & 0xFFFFFFFFFFFFFFFF) / 1_000)
    linuxEpoch = 2_440_588
    return datetime.datetime(1970, 1, 1) + datetime.timedelta(days=julianCalendarDays - linuxEpoch, microseconds=time)  
Fede
  • 460
  • 3
  • 10