A Python interface to the Parquet file format.
Questions tagged [fastparquet]
141 questions
3
votes
0 answers
Failed to install pyarrow and fastparquet
Good afternoon everyone, first of all I am new to python, so please, bear with me.
I am trying to read and manipulate a .parquet file, so I looked up on the internet what should I do and I found that I should use pyarrow or fastparquet.
So I tried…

Beatriz Campos
- 31
- 2
3
votes
0 answers
Pandas Read/Write Parquet Data using Column Index
Is it possible to use pandas to selectively read rows from Parquet files using its column index?
Similarly, when writing a Pandas DataFrame to a Parquet file, such as using pd.DataFrame.to_parquet(), is it possible to specify the DataFrame column or…

Athena Wisdom
- 6,101
- 9
- 36
- 60
3
votes
2 answers
Why is computing the shape on an indexed Parquet file so slow in dask?
I have created a Parquet file from multiple Parquet files located in the same folder. Each file corresponds to a partition.
Parquet files are created in different processes (using Python concurrent.futures). Here is an example of the code I run in…

hadim
- 636
- 1
- 7
- 16
3
votes
1 answer
Dask - How to cancel and resubmit stalled tasks?
Frequently, I encounter an issue where Dask randomly stalls on a couple tasks, usually tied to a read of data from a different node on my network (more details about this below). This can happen after several hours of running the script with no…

dan
- 183
- 13
3
votes
1 answer
Pandas and FastParquet read a single partition
I have a miserably long running job to read in a dataset that has a natural, logical partition on US State. I have saved it as a partitioned parquet dataset from pandas using fastparquet (using pd.write_parquet).
I want my buddy to be able to read…

user3502355
- 147
- 1
- 2
- 14
3
votes
1 answer
Segmentation Fault while reading parquet file from AWS S3 using read_parquet in Python Pandas
I have a python script running on an AWS EC2 (on AWS Linux), and the scripts pulls a parquet file from S3 into Pandas dataframe. I'm now migrating to new AWS account and setting up a new EC2. This time when executing the same script on python…

Niv Cohen
- 1,078
- 2
- 11
- 21
3
votes
1 answer
Dask - Quickest way to get row length of each partition in a Dask dataframe
I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way?
Here's a simplified snippet of…

dan
- 183
- 13
3
votes
1 answer
How to read a single parquet file from s3 into a dask dataframe?
I'm trying to read a single parquet file with snappy compression from s3 into a Dask Dataframe. There is no metadata directory, since this file was written using Spark 2.1
It does not work locally with fastparquet
import dask.dataframe as…

arinarmo
- 375
- 1
- 11
3
votes
1 answer
Generating parquet files - differences between R and Python
We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues:
The format of the Dask (i.e. fastparquet) has a _metadata and a _common_metadata files while the parquet file in R \…

skibee
- 1,279
- 1
- 17
- 37
2
votes
1 answer
Querying last row of sorted column where value is less than specific amount from parquet file
I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below.
X Y
0 1 Red
1 5 Blue
2 8 Green
3 12 Purple
4 15 Blue
5 17 Purple
I am interested in querying the last value…

jd0
- 23
- 3
2
votes
1 answer
AttributeError: 'ParquetFile' object has no attribute 'row_groups'
Pythonistas!
Not sure what I am doing wrong while reading a parquet file here.
I have all the necessary packages installed - pandas, fastparquet & pyarrow
The code literally is reading the parquet file
import pandas as pd
FILE =…

ppatel26
- 183
- 13
2
votes
0 answers
Neither pyarrow nor fastparquet able to read vector output from spark correctly
I have a dataframe written from spark in parquet format which has a column of type 'vector' in it. Printing the schema in spark gives the following
DataFrame[key: string, embedding: vector]
I have tried the following two approaches in python…

Gaurav Manchanda
- 56
- 2
2
votes
0 answers
Cannot use index on Dask dataframe stored in parquet file with Append=True
I have a use case where I want to store multiple Dask dataframes into a common parquet storage through to_parquet(ddf, 'TestParquet', append=True).
The structure of the parquet files is set through the first dataframe being written to it (without…

Matthieu Rosset
- 31
- 4
2
votes
1 answer
Reading large number of parquet files: read_parquet vs from_delayed
I'm reading a larger number (100s to 1000s) of parquet files into a single dask dataframe (single machine, all local). I realized that
files = ['file1.parq', 'file2.parq', ...]
ddf = dd.read_parquet(files,…

mcsoini
- 6,280
- 2
- 15
- 38
2
votes
0 answers
Unable to import fastparquet in jupyter notebooks
Similar questions as this. But my error reports
ModuleNotFoundError: No module named 'fastparquet'
When I run conda list under the same virtual environment, I get
However, I'm able to import fastparquet when I'm in the interactive dialogue in…

Max Wong
- 694
- 10
- 18