I'm trying to read a parquet file using boto3. The original file has dates with the following format:
2016-12-07 23:00:00.000
And they are stored as timestamps.
My code in Sage Maker is:
boto_s3 = boto3.client('s3')
r = boto_s3.select_object_content(
Bucket='bucket_name',
Key='path/file.gz.parquet',
ExpressionType='SQL',
Expression=f"select fecha_instalacion,pais from s3object s ",
InputSerialization = {'Parquet': {}},
OutputSerialization = {'CSV': {}},
)
rl0 = list(r['Payload'])[0]
from io import StringIO
string_csv = rl0['Records']['Payload'].decode('ISO-8859-1')
csv = StringIO(string_csv)
pd.read_csv(csv, names=['fecha_instalacion', 'pais'])
But instead of the date I get:
fecha_instalacion pais
45352962065516692798029824 ESPAĆA
I loooked for dates with only one day in between and the nyuumber of digits that are the same are the first 6. As an example:
45337153205849123712294912--> 2016-12-09 23:00:00.000
45337116312360976293191680--> 2016-12-07 23:00:00.000
I would need to get the correct formated date, and avoid the especial characters.
Thanks.