A Python interface to the Parquet file format.
Questions tagged [fastparquet]
141 questions
1
vote
0 answers
while creating a parquet file using dask (fastparquet) with an append option the first partitions file were missing from the folder
When we create a parquet file with append option the first partition file of the parquet is missing from the final result. Any one know the reason. we are using Dask 2.30. And this happens only in one environment but in another completely different …

Arun
- 41
- 1
- 4
1
vote
1 answer
fastparquet export for Redshift
I had a very simple idea: Use Python Pandas (for convenience) to do some simple database operations with moderate data amounts and write the data back to S3 in Parquet format.
Then, the data should be exposed to Redshift as an external table in…

Werner
- 95
- 11
1
vote
2 answers
How can one append to parquet files and how does it affect partitioning?
Does parquet allow appending to a parquet file periodically ?
How does appending relate to partitioning if any ? For example if i was able to identify a column that had low cardinality and partition it by that column, if i were to append more data…

Abhishek Malik
- 305
- 4
- 14
1
vote
1 answer
Split parquet from s3 into chunks
I'm using the following code to read parquet files from s3. Next, I want to iterate over it in chunks. How can I achieve it?
import s3fs
import fastparquet as fp
s3 = s3fs.S3FileSystem()
fs = s3fs.core.S3FileSystem()
bucket, path = 'mybucket',…

ProgramSpree
- 372
- 5
- 21
1
vote
1 answer
Read parquet file using pd.read_parquet looking for a schema
I'm working on an app that is writing parquet files.
For testing purposes, I'm trying to read a generated file with pd.read_parquet.
I get a really strange error that asks for a schema:
self = <[AttributeError("'ParquetFile' object has no attribute…

Alex
- 389
- 4
- 21
1
vote
1 answer
How to read a 30G parquet file by python
I am trying to read data from a large parquet file of 30G. My memory do not support default reading with fastparquet in python, so I do not know what I should do to lower the memory usage of the reading process.

Kehan Chen
- 11
- 1
1
vote
1 answer
Reading index based range from Parquet File using Python
I'm trying to read a range of data (say row 1000 to 5000) from a parquet file. I've tried pandas with fastparquet engine and even pyarraw but can't seem to find any option to do so.
Is there any way to achieve this?

MetalMonkey
- 17
- 1
- 9
1
vote
1 answer
Is it possible to read a Parquet dataset partitioned by hand using Dask with the Fastparquet reader?
I created a Parquet dataset partitioned as follows:
2019-taxi-trips/
- month=1/
- data.parquet
- month=2/
- data.parquet
...
- month=12/
- data.parquet
This organization follows the Parquet dataset…

Aleksey Bilogur
- 3,686
- 3
- 30
- 57
1
vote
1 answer
How to read nested struct Parquet files in Python?
I have a parquet file which contains list of structs and I cannot seem to read it with any of the available python parquet libraries. Some of them return an error noting that 'list of structs' is not yet supported and the others just make a pandas…

Nilan Saha
- 191
- 1
- 9
1
vote
1 answer
Reading partitioned Parquet files to DataFame in Python (in memory) where a column type is array of array
Context
I have partitioned Parquet files in S3. I want to read and concatenate them into a DataFrame so I can query and view the data (in memory). I did it so far, however one of the columns's data with the type (array>) is converted…

Mahshid Zeinaly
- 3,590
- 6
- 25
- 32
1
vote
0 answers
'S3File' object has no attribute 'forced'
Trying to append a parquet file in S3 using fastparquet lib, getting below error:
File "/Users/baluinfo/PycharmProjects/untitled/rough.py", line 55, in
write(parqKey, ws1 ,write_index=False,append=True,compression='GZIP', open_with=myopen)
…

Bala
- 51
- 1
- 2
1
vote
0 answers
Cannot install fastparquet in jupyter notebook
I am trying to install fastparquet in order to write a csv into a parquet file. Using jupyter notebook, python 3,
the cell does not show any result after running the Following command:
pip install fastparquet
I run another simple command and it…

jusmin
- 7
- 2
1
vote
1 answer
Moving data from a database to Azure blob storage
I'm able to use dask.dataframe.read_sql_table to read the data e.g. df = dd.read_sql_table(table='TABLE', uri=uri, index_col='field', npartitions=N)
What would be the next (best) steps to saving it as a parquet file in Azure blob storage?
From my…

Ray Bell
- 1,508
- 4
- 18
- 45
1
vote
0 answers
dask computation got different errors with pyarrow and s3
I was doing some groupby parallel computation with dask using pyarrow to load parquet files from s3. However, the same piece of code may run or fail (with different error messages) with random chances. Same issue happened when using…

zhh210
- 388
- 4
- 12
1
vote
1 answer
compression option in fastparquet is not consistent
According to the project page of fastparquet, fastparquet support various compression methods
Optional (compression algorithms; gzip is always available):
snappy (aka python-snappy)
lzo
brotli
lz4
zstandard
especially zstandard is modern…

user15964
- 2,507
- 2
- 31
- 57