Questions tagged [fastparquet]

A Python interface to the Parquet file format.

Resources:

141 questions
0
votes
1 answer

Dask dataframe read parquet format fails from http

I have been dealing with this problem for a week. I use the command from dask import dataframe as ddf ddf.read_parquet("http://IP:port/webhdfs/v1/user/...") I got invalid parquet magic. However ddf.read_parquet is Ok with "webhdfs://" I would…
0
votes
1 answer

Is it possible to store a parquet file on disk, while appending, and also retrieving rows by index?

I have 185 files of data, which contains a total number of 30 million rows. Each two has two columns; a single int which I want to use as an index, and a list of 512 ints. So it looks something like this IndexID Ids 1899317 [0, 47715, 1757, 9,…
SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
0
votes
1 answer

dask: read parquet from Azure blob - AzureHttpError

I created a parquet file in an Azure blob using dask.dataframe.to_parquet (Moving data from a database to Azure blob storage). I would now like to read that file. I'm doing: STORAGE_OPTIONS={'account_name': 'ACCOUNT_NAME', …
Ray Bell
  • 1,508
  • 4
  • 18
  • 45
0
votes
1 answer

Dask not recovering partitions from simple (non-Hive) Parquet files

I have a two-part question about Dask+Parquet. I am trying to run queries on a dask dataframe created from a partitioned Parquet file as so: import pandas as pd import dask.dataframe as dd import fastparquet ##### Generate random data to Simulate…
hda 2017
  • 59
  • 6
0
votes
2 answers

Loading parquet file to Redshift

I am trying to save dataframes to parquet and then load them into redshift. For that i do the following: parquet_buffer =…
FrankyBravo
  • 438
  • 1
  • 4
  • 12
0
votes
1 answer

Google bigquery - Error message 'DataFrame' object has no attribute 'to_parquet' whereas pyarrow and fastparquet are installed

I'm trying to use the Google bigquery function load_table_from_dataframe but I get an error message stating that DataFrame object has no attribute to_parquet. I have installed both pyarrow and fastparquet but still getting the same error…
CharlotteB
  • 1
  • 1
  • 2
0
votes
0 answers

How to persist kdb tables to compressed parquet?

I'm trying to store/persist kdb tables in compressed apache parquet format. My initial plan is basically to use embedPy to convert either fastparquet or pyarrow.parquet to be usable from within q. I'll then use the kdb+ tick architecture to process…
Natalie Williams
  • 355
  • 1
  • 3
  • 9
0
votes
0 answers

Symbol not found: _PyClass_Type

I'm to run some tests from fastparquet using pyCharm on macOS Sierra (10.12.6) but keep failing on: ImportError: dlopen(/Users/dhaviv/Documents/GitHub/fastparquet/fastparquet/speedups.so, 2): Symbol not found: _PyClass_Type I've installed…
Daniel Haviv
  • 1,036
  • 8
  • 16
0
votes
1 answer

Converting NaN floats to other types in Parquet format

I currently am processing a bunch of CSV files and transforming them into Parquet. I use these with Hive and query the files directly. I would like to switch over to Dask for my data processing. My data I am reading has optional columns some of…
Eumcoz
  • 2,388
  • 1
  • 21
  • 44
0
votes
1 answer

Skip metadata for large binary fields in fastparquet

If a dataset has a column with large binary data (e.g. an image or a sound-wave data) then computing min/max statistics for that column becomes costly both in compute and storage requirements, despite being completely useless (querying these values…
stav
  • 1,497
  • 2
  • 15
  • 40
0
votes
1 answer

How to pass data generated by a Databricks notebook to a Python step?

I am building an Azure Data Factory v2, which comprises A Databricks step to query large tables from Azure Blob storage and generate a tabular result intermediate_table; A Python step (which does several things and would be cumbersome to put in a…
0
votes
0 answers

Cannot import fastparquet into Python notebook

I am trying to install fastparquet in order to convert a pandas dataframe into a parquet file. But even though I get the following when i run pip install fastparquet Requirement already satisfied: fastparquet in…
0
votes
2 answers

How can Athena read parquet file from S3 bucket

I am porting a python project (s3 + Athena) from using csv to parquet. I can make the parquet file, which can be viewed by Parquet View. I can upload the file to s3 bucket. I can create the Athena table pointing to the s3 bucket. However, when I…
kzfid
  • 688
  • 3
  • 10
  • 17
0
votes
1 answer

Unable to read parquet file, giving Gzip code failed error

I am trying to convert parquet to csv file with pyarrow. df = pd.read_parquet('test.parquet') The above code works fine with the sample parquet files downloaded from github. But when I try with the actual large parquet file, it is giving the…
Pri31
  • 447
  • 1
  • 5
  • 9
0
votes
1 answer

Is it a bug in fastparquet module

I am using AWS sagemaker Jupiter notebook and getting following error: in () 1 import s3fs ----> 2 import fastparquet as fp 3 s3 = s3fs.S3FileSystem() 4 fs = s3fs.core.S3FileSystem() 5…
1 2 3
9
10