1

Context

I have partitioned Parquet files in S3. I want to read and concatenate them into a DataFrame so I can query and view the data (in memory). I did it so far, however one of the columns's data with the type (array<array< double >>) is converted to None. Other columns (such as str, array of int, etc) are converted correctly. I am not sure what I am missing in the process. I imagine the data is missed during this conversion, or the data is there and my querying method is wrong.

Steps I did so far

import s3fs
import fastparquet as fp
import pandas as pd

key = 'MyAWSKey'
secret = 'MyAWSSecret'
token = 'MyAWSToken'

s3_file_system = s3fs.S3FileSystem(secret= secret, token=token, key=key)
file_names = s3_file_system.glob(path='s3://.../*.snappy.parquet')

# <class 'fastparquet.api.ParquetFile'>
fp_api_parquetfile_obj = fp.ParquetFile(files, open_with= s3_file_system.open) 

data = fp_api_parquetfile_obj.to_pandas()

Query Result

# column A type is array of array of doubles
print(pd.Series(data['A']).head(10))
# Prints 10 rows of None! [Incorrect]

# column B type is array of int
print(pd.Series(data['B']).head(10))
# Prints 10 rows of array of int values correctly

# column C type is string
print(pd.Series(data['C']).head(10))
# Prints 10 rows of str values correctly

Please note that the data (array of array of doubles) exist in the files, because I can query it using Athena.

Mahshid Zeinaly
  • 3,590
  • 6
  • 25
  • 32

1 Answers1

1

I could not find any ways to get fastparquet reading the array of array column; instead I used a different library (pyarrow) and it works!

import s3fs
import pandas as pd
import pyarrow.parquet as pq

key = 'MyAWSKey'
secret = 'MyAWSSecret'
token = 'MyAWSToken'

s3_file_system = s3fs.S3FileSystem(secret= secret, token=token, key=key)
file_names = s3_file_system.glob(path='s3://.../*.snappy.parquet')

data_frames = [pq.ParquetDataset('s3://' + f, filesystem= s3_file_system).read_pandas().to_pandas() for f in files]

data = pd.concat(data_frames,ignore_index=True)

# column A type is array of array of doubles
print(pd.Series(data['A']).head(10))
# Prints 10 rows of array of arrays correctly
Mahshid Zeinaly
  • 3,590
  • 6
  • 25
  • 32