8

Im getting this error when transforming a pandas.DF to parquet using pyArrow:

ArrowInvalid('Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer

To find out which column is the problem I made a new df in a for loop, first with the first column and for each loop adding another column. I realized that the error is in a column of dtype: object that starts with 0s, I guess that's why pyArrow wants to convert the column to int but fails because other values are UUID

Im trying to pass a schema: (not sure if this is the way to go)

table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)

where schema is: df.dtypes

Andrii Omelchenko
  • 13,183
  • 12
  • 43
  • 79
Carlos P Ceballos
  • 384
  • 1
  • 7
  • 20

1 Answers1

12

Carlos have you tried converting the column to one of the pandas types listed here https://arrow.apache.org/docs/python/pandas.html?

Can you post the output of df.dtypes?

If changing the pandas column type doesn't help you can define a pyarrow schema to pass in.

fields = [
    pa.field('id', pa.int64()),
    pa.field('secondaryid', pa.int64()),
    pa.field('date', pa.timestamp('ms')),
]

my_schema = pa.schema(fields)

table = pa.Table.from_pandas(sample_df, schema=my_schema, preserve_index=False)

More information here:

https://arrow.apache.org/docs/python/data.html https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas https://arrow.apache.org/docs/python/generated/pyarrow.schema.html

Alexander
  • 1,577
  • 5
  • 21
  • 35
  • 1
    storing an ID as int (in current pandas) can pose problems when there are missing values (loss of information when converting to float with long IDs) or the IDs become very long (20+ chars) – Maarten Fabré Mar 30 '18 at 13:29
  • Hi Alexander, I want to keep the column df dtypes as object due to other operations and transformations that are done latter on. Im trying the pa.schema, if that doesn't work then I will coerce the df dtypes. – Carlos P Ceballos Mar 30 '18 at 16:55
  • While writing parquet it does not actual infer to schema. I have None in columns and I want to convert them to int64 but it converts column to float while writing but in table.schema it shows int64 – Nikhil Redij Sep 23 '20 at 12:44