Highest Voted 'fastparquet' Questions

6

votes

1 answer

Write nested parquet format from Python

I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schema of the JSON in advance if this is of any…

asked Jul 06 '20 at 06:41

Stephan Claus

405
1
6
16

6

votes

3 answers

Why index name always appears in the parquet file created with pandas?

I am trying to create a parquet using pandas dataframe, and even though I delete the index of the file, it is still appearing when I am re-reading the parquet file. Can anyone help me with this? I want index.name to be set as None. >>> df =…

python-3.x pandas dataframe parquet fastparquet

asked Aug 16 '18 at 08:16

Jyoti Dhiman

540
2
6
17

6

votes

3 answers

How to read multiple parquet files (with same schema) from multiple directories with dask/fastparquet

I need to use dask to load multiple parquet files with identical schema into a single dataframe. This works when they are all in the same directory, but not when they're in separate directories. For example: import fastparquet pfile =…

dask fastparquet

asked Sep 22 '17 at 18:34

Tim Morton

240
1
3
11

5

votes

2 answers

Reading Parquet File with Array> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array>'). An example of the df would be: import pandas as pd df = pd.DataFrame.from_records([ (1,…

python dask python-3.7 pyarrow fastparquet

asked Jul 14 '19 at 02:06

Jon.H

794
2
9
23

5

votes

1 answer

memory usage when indexing a large dask dataframe on a single multicore machine

I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance. CirrusSearch dumps come as a single json line formatted file. The English Wipedia dumps contain 5M recards and…

parquet dask dask-distributed fastparquet dask.distributed

asked Jun 29 '18 at 05:31

Daniel Mahler

7,653
5
51
90

5

votes

1 answer

How to write a Dask dataframe containing a column of arrays to a parquet file

I have a Dask dataframe, one column of which contains a numpy array of floats: import dask.dataframe as dd import pandas as pd import numpy as np df = dd.from_pandas( pd.DataFrame( { 'id':range(1, 6), …

python dask fastparquet

asked Feb 14 '18 at 19:05

junichiro

5,282
3
18
26

4

votes

1 answer

Can I access a Parquet file via index without reading the entire file into memory?

I just read that HDF5 allows you to access seek into data without reading the entire file into memory. Is this seeking behavior possible in Parquet files without Java (non-pyspark solutions)? I am using Parquet because of the strong dtype…

parquet pyarrow fastparquet

asked Feb 16 '21 at 00:06

Kermit

4,922
4
42
74

4

votes

1 answer

Read/Write Parquet with Struct column type

I am trying to write a Dataframe like this to Parquet: | foo | bar | |-----|-------------------| | 1 | {"a": 1, "b": 10} | | 2 | {"a": 2, "b": 20} | | 3 | {"a": 3, "b": 30} | I am doing it with Pandas and Fastparquet: df =…

apache-spark pyspark apache-spark-sql pyarrow fastparquet

asked Feb 14 '20 at 13:17

Dario Chi

43
1
1
6

4

votes

0 answers

Unable to read parquet file with fastparquet but works with pyarrow - nullable ints

Currently running some code like this: df = pd.read_parquet('/tmp/my-file.parquet', engine='pyarrow') I was having memory consumption issues since the files are large so I wanted to investigate is fastparquet would work better for memory…

python pandas parquet pyarrow fastparquet

asked Jun 04 '19 at 14:58

JD D

7,398
2
34
53

4

votes

2 answers

Python Pandas to convert CSV to Parquet using Fastparquet

I am using Python 3.6 interpreter in my PyCharm venv, and trying to convert a CSV to Parquet. import pandas as pd df = pd.read_csv('/parquet/drivers.csv') df.to_parquet('output.parquet') Error-1 ImportError: Unable to find a usable engine;…

python pandas fastparquet

asked Feb 12 '19 at 02:42

Himalay Majumdar

3,883
14
65
94

4

votes

3 answers

dask read_parquet with pyarrow memory blow up

I am using dask to write and read parquet. I am writing using fastparquet engine and reading using pyarrow engine. My worker has 1 gb of memory. With fastparquet the memory usage is fine, but when i switch to pyarrow, it just blows up and causes the…

dask pyarrow fastparquet

asked Jun 15 '18 at 10:13

pranav kohli

123
2
6

4

votes

0 answers

Optimal approach to create dask dataframe from parquet files(HDFS) in different directories

I am trying to create dask dataframe from large number of parquet files stored different HDFS directories. I have tried two approaches but both of them seems to take very long time. Approach 1: call api read_parquet with glob path.…

dask dask-distributed fastparquet

asked Mar 22 '18 at 07:21

Santosh Kumar

761
5
28

3

votes

0 answers

Why does pandas.to_parquet need so much RAM?

Wisdom of such a use case aside... how come a machine with 512Gb RAM (and nothing else running) run out of memory while trying to save a pandas df (df.to_parquet(...)) that has object size (sys.getsizeof) "only" ~25Gb? The df is ~73mm rows by 2…

python pandas parquet pyarrow fastparquet

asked Jun 16 '22 at 19:31

Tim

236
2
8

3

votes

0 answers

unable to export dataframe due to overflow error: Python int too large to convert to C long

I am trying to export a pandas dataframe as a parquet file. This dataframe has a memory usage of 4GB+ with 76 million rows and 6 columns (int64(3) columns, object(3) columns). When I write this out as a parquet file, I am getting an OverflowError:…

python pandas dataframe fastparquet

asked Jan 13 '22 at 22:53

veg2020

956
10
27

3

votes

1 answer

How to convert parquet to json

I have parquet files hosted on S3 that I want to download and convert to JSON. I was able to use select_object_content to output certain files as JSON using SQL in the past. I need to find a faster way to do it because it is timing out for larger…

json python-3.x amazon-s3 parquet fastparquet

asked Dec 29 '21 at 02:38

bnykpp

55
1
5

Questions tagged [fastparquet]

Resources:

Write nested parquet format from Python

Why index name always appears in the parquet file created with pandas?

How to read multiple parquet files (with same schema) from multiple directories with dask/fastparquet

Reading Parquet File with Array> Column

memory usage when indexing a large dask dataframe on a single multicore machine

How to write a Dask dataframe containing a column of arrays to a parquet file

Can I access a Parquet file via index without reading the entire file into memory?

Read/Write Parquet with Struct column type

Unable to read parquet file with fastparquet but works with pyarrow - nullable ints

Python Pandas to convert CSV to Parquet using Fastparquet

dask read_parquet with pyarrow memory blow up

Optimal approach to create dask dataframe from parquet files(HDFS) in different directories

Why does pandas.to_parquet need so much RAM?

unable to export dataframe due to overflow error: Python int too large to convert to C long

How to convert parquet to json