A Python interface to the Parquet file format.
Questions tagged [fastparquet]
141 questions
6
votes
1 answer
Write nested parquet format from Python
I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schema of the JSON in advance if this is of any…

Stephan Claus
- 405
- 1
- 6
- 16
6
votes
3 answers
Why index name always appears in the parquet file created with pandas?
I am trying to create a parquet using pandas dataframe, and even though I delete the index of the file, it is still appearing when I am re-reading the parquet file. Can anyone help me with this? I want index.name to be set as None.
>>> df =…

Jyoti Dhiman
- 540
- 2
- 6
- 17
6
votes
3 answers
How to read multiple parquet files (with same schema) from multiple directories with dask/fastparquet
I need to use dask to load multiple parquet files with identical schema into a single dataframe. This works when they are all in the same directory, but not when they're in separate directories.
For example:
import fastparquet
pfile =…

Tim Morton
- 240
- 1
- 3
- 11
5
votes
2 answers
Reading Parquet File with Array
I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array

Jon.H
- 794
- 2
- 9
- 23
5
votes
1 answer
memory usage when indexing a large dask dataframe on a single multicore machine
I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance.
CirrusSearch dumps come as a single json line formatted file.
The English Wipedia dumps contain 5M recards and…

Daniel Mahler
- 7,653
- 5
- 51
- 90
5
votes
1 answer
How to write a Dask dataframe containing a column of arrays to a parquet file
I have a Dask dataframe, one column of which contains a numpy array of floats:
import dask.dataframe as dd
import pandas as pd
import numpy as np
df = dd.from_pandas(
pd.DataFrame(
{
'id':range(1, 6),
…

junichiro
- 5,282
- 3
- 18
- 26
4
votes
1 answer
Can I access a Parquet file via index without reading the entire file into memory?
I just read that HDF5 allows you to access seek into data without reading the entire file into memory.
Is this seeking behavior possible in Parquet files without Java (non-pyspark solutions)? I am using Parquet because of the strong dtype…

Kermit
- 4,922
- 4
- 42
- 74
4
votes
1 answer
Read/Write Parquet with Struct column type
I am trying to write a Dataframe like this to Parquet:
| foo | bar |
|-----|-------------------|
| 1 | {"a": 1, "b": 10} |
| 2 | {"a": 2, "b": 20} |
| 3 | {"a": 3, "b": 30} |
I am doing it with Pandas and Fastparquet:
df =…

Dario Chi
- 43
- 1
- 1
- 6
4
votes
0 answers
Unable to read parquet file with fastparquet but works with pyarrow - nullable ints
Currently running some code like this:
df = pd.read_parquet('/tmp/my-file.parquet', engine='pyarrow')
I was having memory consumption issues since the files are large so I wanted to investigate is fastparquet would work better for memory…

JD D
- 7,398
- 2
- 34
- 53
4
votes
2 answers
Python Pandas to convert CSV to Parquet using Fastparquet
I am using Python 3.6 interpreter in my PyCharm venv, and trying to convert a CSV to Parquet.
import pandas as pd
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet')
Error-1
ImportError: Unable to find a usable engine;…

Himalay Majumdar
- 3,883
- 14
- 65
- 94
4
votes
3 answers
dask read_parquet with pyarrow memory blow up
I am using dask to write and read parquet. I am writing using fastparquet engine and reading using pyarrow engine.
My worker has 1 gb of memory. With fastparquet the memory usage is fine, but when i switch to pyarrow, it just blows up and causes the…

pranav kohli
- 123
- 2
- 6
4
votes
0 answers
Optimal approach to create dask dataframe from parquet files(HDFS) in different directories
I am trying to create dask dataframe from large number of parquet files stored different HDFS directories. I have tried two approaches but both of them seems to take very long time.
Approach 1: call api read_parquet with glob path.…

Santosh Kumar
- 761
- 5
- 28
3
votes
0 answers
Why does pandas.to_parquet need so much RAM?
Wisdom of such a use case aside... how come a machine with 512Gb RAM (and nothing else running) run out of memory while trying to save a pandas df (df.to_parquet(...)) that has object size (sys.getsizeof) "only" ~25Gb? The df is ~73mm rows by 2…

Tim
- 236
- 2
- 8
3
votes
0 answers
unable to export dataframe due to overflow error: Python int too large to convert to C long
I am trying to export a pandas dataframe as a parquet file. This dataframe has a memory usage of 4GB+ with 76 million rows and 6 columns (int64(3) columns, object(3) columns).
When I write this out as a parquet file, I am getting an OverflowError:…

veg2020
- 956
- 10
- 27
3
votes
1 answer
How to convert parquet to json
I have parquet files hosted on S3 that I want to download and convert to JSON. I was able to use select_object_content to output certain files as JSON using SQL in the past. I need to find a faster way to do it because it is timing out for larger…

bnykpp
- 55
- 1
- 5