7

I have a problem with filetypes when converting a parquet file to a dataframe.

I do

bucket = 's3://some_bucket/test/usages'

import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()

read_pq = pq.ParquetDataset(bucket, filesystem=s3).read_pandas()

When I do read_pq, I get

pyarrow.Table
_COL_0: decimal(9, 0)
_COL_1: decimal(9, 0)
_COL_2: decimal(9, 0)
_COL_3: decimal(9, 0)

When I do df = read_pd.to_pandas(); df.dtypes, I get

_COL_0    object
_COL_1    object
_COL_2    object
_COL_3    object
dtype: object

The original data are all integers. When I operate on the objects in the pandas dataframe, the operations are very slow.

  • How can I convert the parquet columns to a format that will be read as an int or as a float in pandas?
  • Or is it best to operate on the pandas dataframe as above and use pd.to_numeric or similar?
  • Or is there an issue with the original dataformat decimal(9, 0)?

Or is it best to convert on the pandas dataframe directly?

I tried: read_pq.column('_COL_0').cast('int32') throws an error like

No cast implemented from decimal(9, 0) to int32
clog14
  • 1,549
  • 1
  • 16
  • 32

2 Answers2

2

Pandas is funny about integers and such. From what I understand in reading pandas documentation, Pandas does not really seem to have a concept of int versus float and mostly works in float values.

In this situation I would go ahead and use astype to start working with your data like this:

df['_COL_0'] = df['_COL_0'].astype(float)

If they are truly all integers then you should be able to use this simple for loop to cast all the pandas series (columns) to float values like so:

for col in df.columns:
  df[col] = df[col].astype(float)

Let me know if this works for you, I just ran a test in my Jupyter NoteBook and it seemed to work out.

git_rekt
  • 54
  • 3
1

One common reason that integer columns are converted to float types is the presence of null or missing values (NaN) in the data. Pandas represents missing values using NaN, which is a special float value (np.nan). Since the Pandas integer type does not support NaN, columns containing NaN values are automatically converted to float types to accommodate the missing values.

Since 1.2.0, there is an optional argument use_nullable_dtypes in DataFrame.read_parquet function

import pandas as pd

bucket = 's3://some_bucket/test/usages'

df = pd.read_parquets(bucket, use_nullable_dtypes=True)

Here is the official document.

https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

Tom Tang
  • 1,064
  • 9
  • 10