converting ParquetFile to pandas Dataframe with a column with a set of string in python

Question

I have a parquet file which has a simple file schema with a few columns. I read it into python using the code below

from fastparquet import ParquetFile
pf = ParquetFile('inout_files.parquet')

This runs fine, but when I convert it into pandas using the code below I get the following error:

df = pf.to_pandas()

The error is:

 NotImplementedError: Encoding 4

To find the source of the error I ran df=pf.to_pandas(columns=col_to_retrieve) adding the columns separately and notice the error raises from one of the columns which has list of strings (e.g. ("a","b","c")) as value for each cell of the column.

Do you know how to convert it to pandas knowing that there is column with type set(string)?

Is possible use [pd.read_parquet](http://pandas.pydata.org/pandas-docs/stable/io.html#io-parquet) ? — jezrael, Jan 05 '18 at 13:59
Thanks @jezrael, but when the engine= ‘fastparquet’ I get the same error and with engine='pyarrow', I get the error below which I assume is related to the same column issue: pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483775 — Reyhaneh, Jan 05 '18 at 14:55
I opened https://issues.apache.org/jira/browse/PARQUET-1186 about more gracefully handling column with very large binary data — Wes McKinney, Jan 05 '18 at 16:07

Ringworm · Answer 1 · 2018-02-02T01:04:23.203

After re-reading the question I'm concerned my answer may be a non-sequitor...

I am having a related problem with a very large dataframe/parquet and getting the Error: "BinaryArray cannot contain more than 2147483646 bytes".

It appears that fastparquet can read my large table without errors and pyarrow can write them without issues, as long as I don't have category types. So this my current workaround until this issue is solved:

0) Take dataframe without category columns and make a table:

import pyarrow as pa    
table = pa.Table.from_pandas(df)

1) write my tables using pyarrow.parquet:

 import pyarrow.parquet as pq
 pq.write_table(table, 'example.parquet')

2) read my tables using fastparquet:

from fastparquet import ParquetFile 
pf = ParquetFile('example.parquet')

3) convert to pandas using fastparquet:

df = pf.to_pandas()

converting ParquetFile to pandas Dataframe with a column with a set of string in python

1 Answers1