A Python interface to the Parquet file format.
Questions tagged [fastparquet]
141 questions
0
votes
1 answer
Dask dataframe read parquet format fails from http
I have been dealing with this problem for a week.
I use the command
from dask import dataframe as ddf
ddf.read_parquet("http://IP:port/webhdfs/v1/user/...")
I got invalid parquet magic.
However ddf.read_parquet is Ok with "webhdfs://"
I would…
0
votes
1 answer
Is it possible to store a parquet file on disk, while appending, and also retrieving rows by index?
I have 185 files of data, which contains a total number of 30 million rows. Each two has two columns; a single int which I want to use as an index, and a list of 512 ints.
So it looks something like this
IndexID Ids
1899317 [0, 47715, 1757, 9,…

SantoshGupta7
- 5,607
- 14
- 58
- 116
0
votes
1 answer
dask: read parquet from Azure blob - AzureHttpError
I created a parquet file in an Azure blob using dask.dataframe.to_parquet (Moving data from a database to Azure blob storage).
I would now like to read that file. I'm doing:
STORAGE_OPTIONS={'account_name': 'ACCOUNT_NAME',
…

Ray Bell
- 1,508
- 4
- 18
- 45
0
votes
1 answer
Dask not recovering partitions from simple (non-Hive) Parquet files
I have a two-part question about Dask+Parquet. I am trying to run queries on a dask dataframe created from a partitioned Parquet file as so:
import pandas as pd
import dask.dataframe as dd
import fastparquet
##### Generate random data to Simulate…

hda 2017
- 59
- 6
0
votes
2 answers
Loading parquet file to Redshift
I am trying to save dataframes to parquet and then load them into redshift.
For that i do the following:
parquet_buffer =…

FrankyBravo
- 438
- 1
- 4
- 12
0
votes
1 answer
Google bigquery - Error message 'DataFrame' object has no attribute 'to_parquet' whereas pyarrow and fastparquet are installed
I'm trying to use the Google bigquery function load_table_from_dataframe but I get an error message stating that DataFrame object has no attribute to_parquet.
I have installed both pyarrow and fastparquet but still getting the same error…

CharlotteB
- 1
- 1
- 2
0
votes
0 answers
How to persist kdb tables to compressed parquet?
I'm trying to store/persist kdb tables in compressed apache parquet format.
My initial plan is basically to use embedPy to convert either fastparquet or pyarrow.parquet to be usable from within q.
I'll then use the kdb+ tick architecture to process…

Natalie Williams
- 355
- 1
- 3
- 9
0
votes
0 answers
Symbol not found: _PyClass_Type
I'm to run some tests from fastparquet using pyCharm on macOS Sierra (10.12.6) but keep failing on:
ImportError: dlopen(/Users/dhaviv/Documents/GitHub/fastparquet/fastparquet/speedups.so, 2): Symbol not found: _PyClass_Type
I've installed…

Daniel Haviv
- 1,036
- 8
- 16
0
votes
1 answer
Converting NaN floats to other types in Parquet format
I currently am processing a bunch of CSV files and transforming them into Parquet. I use these with Hive and query the files directly. I would like to switch over to Dask for my data processing. My data I am reading has optional columns some of…

Eumcoz
- 2,388
- 1
- 21
- 44
0
votes
1 answer
Skip metadata for large binary fields in fastparquet
If a dataset has a column with large binary data (e.g. an image or a sound-wave data) then computing min/max statistics for that column becomes costly both in compute and storage requirements, despite being completely useless (querying these values…

stav
- 1,497
- 2
- 15
- 40
0
votes
1 answer
How to pass data generated by a Databricks notebook to a Python step?
I am building an Azure Data Factory v2, which comprises
A Databricks step to query large tables from Azure Blob storage and generate a tabular result intermediate_table;
A Python step (which does several things and would be cumbersome to put in a…

Davide Fiocco
- 5,350
- 5
- 35
- 72
0
votes
0 answers
Cannot import fastparquet into Python notebook
I am trying to install fastparquet in order to convert a pandas dataframe into a parquet file. But even though I get the following when i run pip install fastparquet
Requirement already satisfied: fastparquet in…

Avantika Banerjee
- 307
- 1
- 17
0
votes
2 answers
How can Athena read parquet file from S3 bucket
I am porting a python project (s3 + Athena) from using csv to parquet.
I can make the parquet file, which can be viewed by Parquet View.
I can upload the file to s3 bucket.
I can create the Athena table pointing to the s3 bucket.
However, when I…

kzfid
- 688
- 3
- 10
- 17
0
votes
1 answer
Unable to read parquet file, giving Gzip code failed error
I am trying to convert parquet to csv file with pyarrow.
df = pd.read_parquet('test.parquet')
The above code works fine with the sample parquet files downloaded from github.
But when I try with the actual large parquet file, it is giving the…

Pri31
- 447
- 1
- 5
- 9
0
votes
1 answer
Is it a bug in fastparquet module
I am using AWS sagemaker Jupiter notebook and getting following error:
in ()
1 import s3fs
----> 2 import fastparquet as fp
3 s3 = s3fs.S3FileSystem()
4 fs = s3fs.core.S3FileSystem()
5…

Pol99
- 111
- 8