Highest Voted 'fastparquet' Questions

3

votes

0 answers

Failed to install pyarrow and fastparquet

Good afternoon everyone, first of all I am new to python, so please, bear with me. I am trying to read and manipulate a .parquet file, so I looked up on the internet what should I do and I found that I should use pyarrow or fastparquet. So I tried…

asked Feb 05 '21 at 13:23

Beatriz Campos

31
2

3

votes

0 answers

Pandas Read/Write Parquet Data using Column Index

Is it possible to use pandas to selectively read rows from Parquet files using its column index? Similarly, when writing a Pandas DataFrame to a Parquet file, such as using pd.DataFrame.to_parquet(), is it possible to specify the DataFrame column or…

python pandas parquet pyarrow fastparquet

asked Jun 07 '20 at 22:03

Athena Wisdom

6,101
9
36
60

3

votes

2 answers

Why is computing the shape on an indexed Parquet file so slow in dask?

I have created a Parquet file from multiple Parquet files located in the same folder. Each file corresponds to a partition. Parquet files are created in different processes (using Python concurrent.futures). Here is an example of the code I run in…

dask parquet fastparquet

asked Nov 25 '19 at 00:30

hadim

636
1
7
16

3

votes

1 answer

Dask - How to cancel and resubmit stalled tasks?

Frequently, I encounter an issue where Dask randomly stalls on a couple tasks, usually tied to a read of data from a different node on my network (more details about this below). This can happen after several hours of running the script with no…

python-3.x dask dask-distributed dask-delayed fastparquet

asked Nov 13 '19 at 14:56

dan

183
13

3

votes

1 answer

Pandas and FastParquet read a single partition

I have a miserably long running job to read in a dataset that has a natural, logical partition on US State. I have saved it as a partitioned parquet dataset from pandas using fastparquet (using pd.write_parquet). I want my buddy to be able to read…

python pandas parquet pyarrow fastparquet

asked Oct 11 '19 at 02:55

user3502355

147
1
2
14

3

votes

1 answer

Segmentation Fault while reading parquet file from AWS S3 using read_parquet in Python Pandas

I have a python script running on an AWS EC2 (on AWS Linux), and the scripts pulls a parquet file from S3 into Pandas dataframe. I'm now migrating to new AWS account and setting up a new EC2. This time when executing the same script on python…

python pandas amazon-web-services fastparquet

asked Sep 04 '19 at 15:11

Niv Cohen

1,078
2
11
21

3

votes

1 answer

Dask - Quickest way to get row length of each partition in a Dask dataframe

I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way? Here's a simplified snippet of…

dask dask-distributed dask-delayed fastparquet

asked Aug 13 '19 at 17:44

dan

183
13

3

votes

1 answer

How to read a single parquet file from s3 into a dask dataframe?

I'm trying to read a single parquet file with snappy compression from s3 into a Dask Dataframe. There is no metadata directory, since this file was written using Spark 2.1 It does not work locally with fastparquet import dask.dataframe as…

python dask fastparquet

asked Jan 16 '18 at 07:36

arinarmo

375
1
11

3

votes

1 answer

Generating parquet files - differences between R and Python

We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues: The format of the Dask (i.e. fastparquet) has a _metadata and a _common_metadata files while the parquet file in R \…

r parquet dask apache-drill fastparquet

asked Jul 31 '17 at 12:21

skibee

1,279
1
17
37

2

votes

1 answer

Querying last row of sorted column where value is less than specific amount from parquet file

I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below. X Y 0 1 Red 1 5 Blue 2 8 Green 3 12 Purple 4 15 Blue 5 17 Purple I am interested in querying the last value…

pyspark parquet python-polars fastparquet duckdb

asked Feb 01 '23 at 22:49

jd0

23
3

2

votes

1 answer

AttributeError: 'ParquetFile' object has no attribute 'row_groups'

Pythonistas! Not sure what I am doing wrong while reading a parquet file here. I have all the necessary packages installed - pandas, fastparquet & pyarrow The code literally is reading the parquet file import pandas as pd FILE =…

python pandas parquet pyarrow fastparquet

asked Sep 10 '22 at 04:16

ppatel26

183
13

2

votes

0 answers

Neither pyarrow nor fastparquet able to read vector output from spark correctly

I have a dataframe written from spark in parquet format which has a column of type 'vector' in it. Printing the schema in spark gives the following DataFrame[key: string, embedding: vector] I have tried the following two approaches in python…

pandas apache-spark parquet pyarrow fastparquet

asked Jan 11 '21 at 08:38

Gaurav Manchanda

56
2

2

votes

0 answers

Cannot use index on Dask dataframe stored in parquet file with Append=True

I have a use case where I want to store multiple Dask dataframes into a common parquet storage through to_parquet(ddf, 'TestParquet', append=True). The structure of the parquet files is set through the first dataframe being written to it (without…

python pandas dask fastparquet

asked Aug 19 '20 at 01:03

Matthieu Rosset

31
4

2

votes

1 answer

Reading large number of parquet files: read_parquet vs from_delayed

I'm reading a larger number (100s to 1000s) of parquet files into a single dask dataframe (single machine, all local). I realized that files = ['file1.parq', 'file2.parq', ...] ddf = dd.read_parquet(files,…

python pandas dask fastparquet

asked Jan 27 '20 at 15:29

mcsoini

6,280
2
15
38

2

votes

0 answers

Unable to import fastparquet in jupyter notebooks

Similar questions as this. But my error reports ModuleNotFoundError: No module named 'fastparquet' When I run conda list under the same virtual environment, I get However, I'm able to import fastparquet when I'm in the interactive dialogue in…

python python-3.x importerror fastparquet

asked Dec 25 '19 at 05:51

Max Wong

694
10
18

Questions tagged [fastparquet]

Resources:

Failed to install pyarrow and fastparquet

Pandas Read/Write Parquet Data using Column Index

Why is computing the shape on an indexed Parquet file so slow in dask?

Dask - How to cancel and resubmit stalled tasks?

Pandas and FastParquet read a single partition

Segmentation Fault while reading parquet file from AWS S3 using read_parquet in Python Pandas

Dask - Quickest way to get row length of each partition in a Dask dataframe

How to read a single parquet file from s3 into a dask dataframe?

Generating parquet files - differences between R and Python

Querying last row of sorted column where value is less than specific amount from parquet file

AttributeError: 'ParquetFile' object has no attribute 'row_groups'

Neither pyarrow nor fastparquet able to read vector output from spark correctly

Cannot use index on Dask dataframe stored in parquet file with Append=True

Reading large number of parquet files: read_parquet vs from_delayed

Unable to import fastparquet in jupyter notebooks