How to read in files with .snappy.parquet extension

Question

I have files with .snappy.parquet extension that I need to read into my Jupyter notebook, and convert it to pandas dataframe.

import numpy
import pyarrow.parquet as pq

filename = "part-00000-tid-2430471264870034304-5b82f32f-de64-40fb-86c0-fb7df2558985-1598426-1-c000.snappy.parquet" 
df = pq.read_table(filename).to_pandas()

The error is:

ArrowNotImplementedError: lists with structs are not supported

score 4 · Answer 1 · answered Nov 30 '19 at 15:19

As of 2019-11-30, columns which are of type List[Struct[..]] (i.e. mixed nesting of lists and structs) are not supported by Apache Arrow. As mentioned in a different answer, the related issue is https://issues.apache.org/jira/browse/ARROW-1644.

To still read this file, you can read in all columns that are of supported types by supplying the columns argument to pyarrow.parquet.read_table. To find out which columns have the complex nested types, look at the schema of the file using pyarrow.parquet.ParquetFile(filename).schema.

How to read in files with .snappy.parquet extension

1 Answers1