Questions tagged [fastparquet]

A Python interface to the Parquet file format.

Resources:

141 questions
6
votes
1 answer

Write nested parquet format from Python

I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schema of the JSON in advance if this is of any…
Stephan Claus
  • 405
  • 1
  • 6
  • 16
6
votes
3 answers

Why index name always appears in the parquet file created with pandas?

I am trying to create a parquet using pandas dataframe, and even though I delete the index of the file, it is still appearing when I am re-reading the parquet file. Can anyone help me with this? I want index.name to be set as None. >>> df =…
Jyoti Dhiman
  • 540
  • 2
  • 6
  • 17
6
votes
3 answers

How to read multiple parquet files (with same schema) from multiple directories with dask/fastparquet

I need to use dask to load multiple parquet files with identical schema into a single dataframe. This works when they are all in the same directory, but not when they're in separate directories. For example: import fastparquet pfile =…
Tim Morton
  • 240
  • 1
  • 3
  • 11
5
votes
2 answers

Reading Parquet File with Array> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array>'). An example of the df would be: import pandas as pd df = pd.DataFrame.from_records([ (1,…
Jon.H
  • 794
  • 2
  • 9
  • 23
5
votes
1 answer

memory usage when indexing a large dask dataframe on a single multicore machine

I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance. CirrusSearch dumps come as a single json line formatted file. The English Wipedia dumps contain 5M recards and…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
5
votes
1 answer

How to write a Dask dataframe containing a column of arrays to a parquet file

I have a Dask dataframe, one column of which contains a numpy array of floats: import dask.dataframe as dd import pandas as pd import numpy as np df = dd.from_pandas( pd.DataFrame( { 'id':range(1, 6), …
junichiro
  • 5,282
  • 3
  • 18
  • 26
4
votes
1 answer

Can I access a Parquet file via index without reading the entire file into memory?

I just read that HDF5 allows you to access seek into data without reading the entire file into memory. Is this seeking behavior possible in Parquet files without Java (non-pyspark solutions)? I am using Parquet because of the strong dtype…
Kermit
  • 4,922
  • 4
  • 42
  • 74
4
votes
1 answer

Read/Write Parquet with Struct column type

I am trying to write a Dataframe like this to Parquet: | foo | bar | |-----|-------------------| | 1 | {"a": 1, "b": 10} | | 2 | {"a": 2, "b": 20} | | 3 | {"a": 3, "b": 30} | I am doing it with Pandas and Fastparquet: df =…
4
votes
0 answers

Unable to read parquet file with fastparquet but works with pyarrow - nullable ints

Currently running some code like this: df = pd.read_parquet('/tmp/my-file.parquet', engine='pyarrow') I was having memory consumption issues since the files are large so I wanted to investigate is fastparquet would work better for memory…
JD D
  • 7,398
  • 2
  • 34
  • 53
4
votes
2 answers

Python Pandas to convert CSV to Parquet using Fastparquet

I am using Python 3.6 interpreter in my PyCharm venv, and trying to convert a CSV to Parquet. import pandas as pd df = pd.read_csv('/parquet/drivers.csv') df.to_parquet('output.parquet') Error-1 ImportError: Unable to find a usable engine;…
Himalay Majumdar
  • 3,883
  • 14
  • 65
  • 94
4
votes
3 answers

dask read_parquet with pyarrow memory blow up

I am using dask to write and read parquet. I am writing using fastparquet engine and reading using pyarrow engine. My worker has 1 gb of memory. With fastparquet the memory usage is fine, but when i switch to pyarrow, it just blows up and causes the…
pranav kohli
  • 123
  • 2
  • 6
4
votes
0 answers

Optimal approach to create dask dataframe from parquet files(HDFS) in different directories

I am trying to create dask dataframe from large number of parquet files stored different HDFS directories. I have tried two approaches but both of them seems to take very long time. Approach 1: call api read_parquet with glob path.…
Santosh Kumar
  • 761
  • 5
  • 28
3
votes
0 answers

Why does pandas.to_parquet need so much RAM?

Wisdom of such a use case aside... how come a machine with 512Gb RAM (and nothing else running) run out of memory while trying to save a pandas df (df.to_parquet(...)) that has object size (sys.getsizeof) "only" ~25Gb? The df is ~73mm rows by 2…
Tim
  • 236
  • 2
  • 8
3
votes
0 answers

unable to export dataframe due to overflow error: Python int too large to convert to C long

I am trying to export a pandas dataframe as a parquet file. This dataframe has a memory usage of 4GB+ with 76 million rows and 6 columns (int64(3) columns, object(3) columns). When I write this out as a parquet file, I am getting an OverflowError:…
veg2020
  • 956
  • 10
  • 27
3
votes
1 answer

How to convert parquet to json

I have parquet files hosted on S3 that I want to download and convert to JSON. I was able to use select_object_content to output certain files as JSON using SQL in the past. I need to find a faster way to do it because it is timing out for larger…
bnykpp
  • 55
  • 1
  • 5
1
2
3
9 10