Highest Voted 'fastparquet' Questions

2

votes

1 answer

Efficiently reading only some columns from parquet file on blob storage using dask

How can I efficiently read only some of the columns of a parquet file that is hosted in a cloud blob storage (e.g. S3 / Azure Blob Storage)? The columnar structure is one of the parquet file format's key advantages so that reading columns…

asked Dec 01 '19 at 21:36

stav

1,497
2
15
40

2

votes

1 answer

Writing a Parquet file from multiple Python processes using Dask

Is that possible to write the same Parquet folder from different processes in Python? I use fastparquet. It seems to work but I m wondering how it is possible for the _metadata file to not have conflicts in case two processes write to it at the same…

dask parquet fastparquet

asked Nov 23 '19 at 16:45

hadim

636
1
7
16

2

votes

0 answers

pandas.read_parquet returns 'IsADirectoryError' in Azure databricks notebook

When I execute pd.read_parquet("/dbfs/XX/XX/agg.parquet") to access a parquet file called agg in databricks' dbfs, it returns 'IsADirectoryError'. Although the file is shown as a folder when I use dbutils to list it, I think Spark can just read it…

python pandas parquet databricks fastparquet

asked Nov 22 '19 at 18:33

zzzk

135
10

2

votes

1 answer

Is it possible to read parquet metadata from Dask?

I have thousands of parquet files that I need to process. Before processing the files, I'm trying to get various information about the files using the parquet metadata, such as number of rows in each partition, mins, maxs, etc. I tried reading…

dask parquet dask-distributed dask-delayed fastparquet

asked Oct 03 '19 at 13:01

dan

183
13

2

votes

2 answers

python dask to_parquet taking a lot of memory

I am using python 3 with dask to read a list of parquet files, do some processing and then put it all into a new united parquet file for later usage. The process uses so much memory that it seems it tries to read all the parquet files into memory…

python dataframe dask parquet fastparquet

asked Aug 04 '19 at 07:06

thebeancounter

4,261
8
61
109

2

votes

2 answers

Reading parquet file from AWS S3 using pandas

I try to read a parquet file from AWS S3. The same code works on my windows machine. A Google search produced no results. Pandas should use fastparquet in order to build the dataframe. fastparquet is installed. Code: import boto3 import pandas as…

pandas amazon-s3 parquet fastparquet

asked Jul 30 '19 at 10:33

balderman

22,927
7
34
52

2

votes

0 answers

Save many large Pandas DataFrames into Single Parquet File without loading into memory

I want to try to save many large Pandas DataFrames, that will not fit into memory at once, into a single Parquet file. We would like to have a single big parquet file on disk in order to quickly grab columns we need from that single big…

pandas dataframe parquet pyarrow fastparquet

asked May 31 '19 at 14:22

Nick Fernandez

1,160
1
10
24

1

vote

1 answer

pyarrow timestamp datatype error on parquet file

I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, i am using pyarrow 11.0,0 and python 3.10.Please…

python pandas parquet pyarrow fastparquet

asked Mar 31 '23 at 11:14

Bill

363
3
14

1

vote

2 answers

how to efficiently read pq files - Python

I have a list of files with .pq extension, whose names are stored in a list. My intention is to read these files, filter them based on pandas, and then merge them into a single pandas data frame. Since there are thousands of files, the code…

python pandas parquet fastparquet

asked Feb 23 '23 at 19:42

sergey_208

614
3
21

1

vote

1 answer

How to store and load multi-column index pandas dataframes with parquet

I have a dataset similar to: initial_df = pd.DataFrame([{'a': 0, 'b': 0, 'c': 10.898}, {'a': 0, 'b': 1, 'c': 1.88}, {'a': 1, 'b': 0, 'c': 108.1}, {'a': 1, 'b': 1, 'c': 10.898}]) initial_df.set_index(['a', 'b'], inplace=True) I am able to store it…

python python-3.x pandas parquet fastparquet

asked Nov 27 '22 at 17:20

KerikoN

26
4

1

vote

1 answer

Can a parquet file exceed 2.1GB?

I'm having an issue storing a large dataset (around 40GB) in a single parquet file. I'm using the fastparquet library to append pandas.DataFrames to this parquet dataset file. The following is a minimal example program that appends chunks to a…

python machine-learning dataset parquet fastparquet

asked Nov 24 '22 at 14:39

Alex Pilafian

121
1
5

1

vote

1 answer

What is the least memory-intensive way to read a Parquet file in Python? Is line-by-line possible?

I'm writing a lambda to read records stored in Parquet files, restructure them into a partition_key: {json_record} format, and submit the records to a Kafka queue. I'm wondering if there's any way to do this without reading the entire table into…

python parquet pyarrow fastparquet

asked Aug 04 '22 at 21:46

James Kelleher

1,957
3
18
34

1

vote

0 answers

No such file or directory : fastparquet.llibs\\.load_order

I am trying to convert a CSV file into Parquet format using DASK. The code I'm using is : import dask.dataframe as dd name_function = lambda x: f"tablename.parquet" df = dd.read_csv('tablename.csv') df.to_parquet('Data\\',…

pyinstaller dask dask-dataframe fastparquet

asked Apr 12 '22 at 07:06

linux

157
11

1

vote

1 answer

fastparquet error when saving pandas df to parquet: AttributeError: module 'fastparquet.parquet_thrift' has no attribute 'SchemaElement

import pandas as pd from flatten_json import flatten actual_column_list = ["_id", "external_id", "email", "created_at","updated_at", "dob.timestamp", "dob_1.timestamp","column_10"] data = [{'_id': '60efe3333333445', 'external_id': 'ID2', 'dob':…

pandas python-3.6 parquet nullable fastparquet

asked Apr 03 '22 at 22:30

Dulshan

31
7

1

vote

0 answers

AttributeError: 'numpy.ndarray' object has no attribute '_ndarray' while reading parquet files in pandas

I am trying to read the parquet files using pandas and fastparquet engine, like below. Getting the error. import pandas as pd df = pd.read_parquet("path/to/parquet_file/", engine='fastparquet') Error: File…

python-3.x pandas fastparquet

asked Feb 18 '22 at 05:05

Vineel

35
6

Questions tagged [fastparquet]

Resources:

Efficiently reading only some columns from parquet file on blob storage using dask

Writing a Parquet file from multiple Python processes using Dask

pandas.read_parquet returns 'IsADirectoryError' in Azure databricks notebook

Is it possible to read parquet metadata from Dask?

python dask to_parquet taking a lot of memory

Reading parquet file from AWS S3 using pandas

Save many large Pandas DataFrames into Single Parquet File without loading into memory

pyarrow timestamp datatype error on parquet file

how to efficiently read pq files - Python

How to store and load multi-column index pandas dataframes with parquet

Can a parquet file exceed 2.1GB?

What is the least memory-intensive way to read a Parquet file in Python? Is line-by-line possible?

No such file or directory : fastparquet.llibs\\.load_order

fastparquet error when saving pandas df to parquet: AttributeError: module 'fastparquet.parquet_thrift' has no attribute 'SchemaElement

AttributeError: 'numpy.ndarray' object has no attribute '_ndarray' while reading parquet files in pandas