A Python interface to the Parquet file format.
Questions tagged [fastparquet]
141 questions
2
votes
1 answer
Efficiently reading only some columns from parquet file on blob storage using dask
How can I efficiently read only some of the columns of a parquet file that is hosted in a cloud blob storage (e.g. S3 / Azure Blob Storage)?
The columnar structure is one of the parquet file format's key advantages so that reading columns…

stav
- 1,497
- 2
- 15
- 40
2
votes
1 answer
Writing a Parquet file from multiple Python processes using Dask
Is that possible to write the same Parquet folder from different processes in Python?
I use fastparquet.
It seems to work but I m wondering how it is possible for the _metadata file to not have conflicts in case two processes write to it at the same…

hadim
- 636
- 1
- 7
- 16
2
votes
0 answers
pandas.read_parquet returns 'IsADirectoryError' in Azure databricks notebook
When I execute pd.read_parquet("/dbfs/XX/XX/agg.parquet") to access a parquet file called agg in databricks' dbfs, it returns 'IsADirectoryError'. Although the file is shown as a folder when I use dbutils to list it, I think Spark can just read it…

zzzk
- 135
- 10
2
votes
1 answer
Is it possible to read parquet metadata from Dask?
I have thousands of parquet files that I need to process. Before processing the files, I'm trying to get various information about the files using the parquet metadata, such as number of rows in each partition, mins, maxs, etc.
I tried reading…

dan
- 183
- 13
2
votes
2 answers
python dask to_parquet taking a lot of memory
I am using python 3 with dask to read a list of parquet files, do some processing and then put it all into a new united parquet file for later usage.
The process uses so much memory that it seems it tries to read all the parquet files into memory…

thebeancounter
- 4,261
- 8
- 61
- 109
2
votes
2 answers
Reading parquet file from AWS S3 using pandas
I try to read a parquet file from AWS S3.
The same code works on my windows machine.
A Google search produced no results.
Pandas should use fastparquet in order to build the dataframe. fastparquet is installed.
Code:
import boto3
import pandas as…

balderman
- 22,927
- 7
- 34
- 52
2
votes
0 answers
Save many large Pandas DataFrames into Single Parquet File without loading into memory
I want to try to save many large Pandas DataFrames, that will not fit into memory at once, into a single Parquet file. We would like to have a single big parquet file on disk in order to quickly grab columns we need from that single big…

Nick Fernandez
- 1,160
- 1
- 10
- 24
1
vote
1 answer
pyarrow timestamp datatype error on parquet file
I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, i am using pyarrow 11.0,0 and python 3.10.Please…

Bill
- 363
- 3
- 14
1
vote
2 answers
how to efficiently read pq files - Python
I have a list of files with .pq extension, whose names are stored in a list. My intention is to read these files, filter them based on pandas, and then merge them into a single pandas data frame.
Since there are thousands of files, the code…

sergey_208
- 614
- 3
- 21
1
vote
1 answer
How to store and load multi-column index pandas dataframes with parquet
I have a dataset similar to:
initial_df = pd.DataFrame([{'a': 0, 'b': 0, 'c': 10.898}, {'a': 0, 'b': 1, 'c': 1.88}, {'a': 1, 'b': 0, 'c': 108.1}, {'a': 1, 'b': 1, 'c': 10.898}])
initial_df.set_index(['a', 'b'], inplace=True)
I am able to store it…

KerikoN
- 26
- 4
1
vote
1 answer
Can a parquet file exceed 2.1GB?
I'm having an issue storing a large dataset (around 40GB) in a single parquet file.
I'm using the fastparquet library to append pandas.DataFrames to this parquet dataset file. The following is a minimal example program that appends chunks to a…

Alex Pilafian
- 121
- 1
- 5
1
vote
1 answer
What is the least memory-intensive way to read a Parquet file in Python? Is line-by-line possible?
I'm writing a lambda to read records stored in Parquet files, restructure them into a partition_key: {json_record} format, and submit the records to a Kafka queue. I'm wondering if there's any way to do this without reading the entire table into…

James Kelleher
- 1,957
- 3
- 18
- 34
1
vote
0 answers
No such file or directory : fastparquet.llibs\\.load_order
I am trying to convert a CSV file into Parquet format using DASK.
The code I'm using is :
import dask.dataframe as dd
name_function = lambda x: f"tablename.parquet"
df = dd.read_csv('tablename.csv')
df.to_parquet('Data\\',…

linux
- 157
- 11
1
vote
1 answer
fastparquet error when saving pandas df to parquet: AttributeError: module 'fastparquet.parquet_thrift' has no attribute 'SchemaElement
import pandas as pd
from flatten_json import flatten
actual_column_list = ["_id", "external_id", "email", "created_at","updated_at", "dob.timestamp", "dob_1.timestamp","column_10"]
data = [{'_id': '60efe3333333445', 'external_id': 'ID2', 'dob':…

Dulshan
- 31
- 7
1
vote
0 answers
AttributeError: 'numpy.ndarray' object has no attribute '_ndarray' while reading parquet files in pandas
I am trying to read the parquet files using pandas and fastparquet engine, like below. Getting the error.
import pandas as pd
df = pd.read_parquet("path/to/parquet_file/", engine='fastparquet')
Error:
File…

Vineel
- 35
- 6