Questions tagged [fastparquet]

A Python interface to the Parquet file format.

Resources:

141 questions
3
votes
0 answers

Failed to install pyarrow and fastparquet

Good afternoon everyone, first of all I am new to python, so please, bear with me. I am trying to read and manipulate a .parquet file, so I looked up on the internet what should I do and I found that I should use pyarrow or fastparquet. So I tried…
3
votes
0 answers

Pandas Read/Write Parquet Data using Column Index

Is it possible to use pandas to selectively read rows from Parquet files using its column index? Similarly, when writing a Pandas DataFrame to a Parquet file, such as using pd.DataFrame.to_parquet(), is it possible to specify the DataFrame column or…
Athena Wisdom
  • 6,101
  • 9
  • 36
  • 60
3
votes
2 answers

Why is computing the shape on an indexed Parquet file so slow in dask?

I have created a Parquet file from multiple Parquet files located in the same folder. Each file corresponds to a partition. Parquet files are created in different processes (using Python concurrent.futures). Here is an example of the code I run in…
hadim
  • 636
  • 1
  • 7
  • 16
3
votes
1 answer

Dask - How to cancel and resubmit stalled tasks?

Frequently, I encounter an issue where Dask randomly stalls on a couple tasks, usually tied to a read of data from a different node on my network (more details about this below). This can happen after several hours of running the script with no…
dan
  • 183
  • 13
3
votes
1 answer

Pandas and FastParquet read a single partition

I have a miserably long running job to read in a dataset that has a natural, logical partition on US State. I have saved it as a partitioned parquet dataset from pandas using fastparquet (using pd.write_parquet). I want my buddy to be able to read…
user3502355
  • 147
  • 1
  • 2
  • 14
3
votes
1 answer

Segmentation Fault while reading parquet file from AWS S3 using read_parquet in Python Pandas

I have a python script running on an AWS EC2 (on AWS Linux), and the scripts pulls a parquet file from S3 into Pandas dataframe. I'm now migrating to new AWS account and setting up a new EC2. This time when executing the same script on python…
Niv Cohen
  • 1,078
  • 2
  • 11
  • 21
3
votes
1 answer

Dask - Quickest way to get row length of each partition in a Dask dataframe

I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way? Here's a simplified snippet of…
dan
  • 183
  • 13
3
votes
1 answer

How to read a single parquet file from s3 into a dask dataframe?

I'm trying to read a single parquet file with snappy compression from s3 into a Dask Dataframe. There is no metadata directory, since this file was written using Spark 2.1 It does not work locally with fastparquet import dask.dataframe as…
arinarmo
  • 375
  • 1
  • 11
3
votes
1 answer

Generating parquet files - differences between R and Python

We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues: The format of the Dask (i.e. fastparquet) has a _metadata and a _common_metadata files while the parquet file in R \…
skibee
  • 1,279
  • 1
  • 17
  • 37
2
votes
1 answer

Querying last row of sorted column where value is less than specific amount from parquet file

I have a large parquet file where the data in one of the columns is sorted. A very simplified example is below. X Y 0 1 Red 1 5 Blue 2 8 Green 3 12 Purple 4 15 Blue 5 17 Purple I am interested in querying the last value…
jd0
  • 23
  • 3
2
votes
1 answer

AttributeError: 'ParquetFile' object has no attribute 'row_groups'

Pythonistas! Not sure what I am doing wrong while reading a parquet file here. I have all the necessary packages installed - pandas, fastparquet & pyarrow The code literally is reading the parquet file import pandas as pd FILE =…
ppatel26
  • 183
  • 13
2
votes
0 answers

Neither pyarrow nor fastparquet able to read vector output from spark correctly

I have a dataframe written from spark in parquet format which has a column of type 'vector' in it. Printing the schema in spark gives the following DataFrame[key: string, embedding: vector] I have tried the following two approaches in python…
2
votes
0 answers

Cannot use index on Dask dataframe stored in parquet file with Append=True

I have a use case where I want to store multiple Dask dataframes into a common parquet storage through to_parquet(ddf, 'TestParquet', append=True). The structure of the parquet files is set through the first dataframe being written to it (without…
2
votes
1 answer

Reading large number of parquet files: read_parquet vs from_delayed

I'm reading a larger number (100s to 1000s) of parquet files into a single dask dataframe (single machine, all local). I realized that files = ['file1.parq', 'file2.parq', ...] ddf = dd.read_parquet(files,…
mcsoini
  • 6,280
  • 2
  • 15
  • 38
2
votes
0 answers

Unable to import fastparquet in jupyter notebooks

Similar questions as this. But my error reports ModuleNotFoundError: No module named 'fastparquet' When I run conda list under the same virtual environment, I get However, I'm able to import fastparquet when I'm in the interactive dialogue in…
Max Wong
  • 694
  • 10
  • 18
1 2
3
9 10