Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions
1
vote
0 answers

Spark Dataframe - java.lang.ArrayIndexOutOfBoundsException: 26 while writing to S3

I am trying to read some data from a snowflake table using Spark-Snowflake connector and write the data to S3 after performing some transformations. But I am seeing the error java.lang.ArrayIndexOutOfBoundsException: 26 when the line…
1
vote
1 answer

How can I write Parquet files with int64 timestamps (instead of int96) from AWS Kinesis Firehose?

Why do int96 timestamps not work for me? I want to read the Parquet files with S3 Select. S3 Select does not support timestamps saved as int96 according to the documentation. Also, storing timestamps in parquet as int96 is deprecated. What did I…
Faber
  • 1,504
  • 2
  • 13
  • 21
1
vote
0 answers

Retain pandas dtype 'category' when using parquet file

I am using parquet to store pandas dataframes, and would like to keep the dtype of columns. However It sometimes ins't working: here is the example code: import pandas as pd import numpy as np df = pd.DataFrame({ 'a': [pd.NA, 'a', 'b', 'c'], …
Leo
  • 1,176
  • 1
  • 13
  • 33
1
vote
0 answers

Reading parquet files from gs or s3a fails with ClassNotFoundException: org.apache.hadoop.conf.Configuration

I'm using Flink 1.16.0 with Kotlin to read and process (snappy-compressed) parquet files that were generated by Spark, and I keep running into ClassNotFoundException: org.apache.hadoop.conf.Configuration. The files are on Google Cloud Storage/gs://,…
1
vote
1 answer

How to store and load multi-column index pandas dataframes with parquet

I have a dataset similar to: initial_df = pd.DataFrame([{'a': 0, 'b': 0, 'c': 10.898}, {'a': 0, 'b': 1, 'c': 1.88}, {'a': 1, 'b': 0, 'c': 108.1}, {'a': 1, 'b': 1, 'c': 10.898}]) initial_df.set_index(['a', 'b'], inplace=True) I am able to store it…
KerikoN
  • 26
  • 4
1
vote
1 answer

Can a parquet file exceed 2.1GB?

I'm having an issue storing a large dataset (around 40GB) in a single parquet file. I'm using the fastparquet library to append pandas.DataFrames to this parquet dataset file. The following is a minimal example program that appends chunks to a…
1
vote
1 answer

PyArrow: How to batch data from mongo into partitioned parquet in S3

I want to be able to archive my data from Mongo into S3. Currently, what I do is Read data from Mongo Convert this into a pyarrow Table Write to S3 It works for now, but steps 1 and 2 is kind of a bulk thing where if the result set is huge it…
Jiew Meng
  • 84,767
  • 185
  • 495
  • 805
1
vote
0 answers

How to get Distinct count from parquet file without reading whole file Using Java ParquetFileReader

I need to get information like get maximum length of a string if the type is string and distinct count from parquet without reading the whole file in Java not using Avro Parquet reader. ParquetFileReader parquetFileReader =…
maya
  • 21
  • 1
1
vote
0 answers

How to null values in a parquet in Scala without Spark

I am looking for a way to read a parquet file, replace values with null in some columns where condition matches and write the data back to the original file. Using Spark it's pretty easy, but I want to achieve this without it. This is how I would do…
1
vote
2 answers

Trying to filter in dask.read_parquet tries to compare NoneType and str

I have a project where I pass the following load_args to read_parquet: filters = {'filters': [('itemId', '=', '9403cfde-7fe5-4c9c-916c-41ff0b595c5c')]} According to the documentation, a List[Tuple] like this should be accepted and I should get all…
filpa
  • 3,651
  • 8
  • 52
  • 91
1
vote
0 answers

AWS GLue Spark job: Found duplicate column(s) in the data schema and the partition schema: `day`, `month`, `year`

Glue spark job is failing with the error message: AnalysisException: Found duplicate column(s) in the data schema and the partition schema: day, month, year. And my actual parquet data file in S3 includes these partition columns as well. Code…
Raaj
  • 29
  • 2
  • 6
1
vote
0 answers

Setting up Bloom filter with PyArrow

I'm writing some datasets to parquet using pyarrow.parquet.write_to_dataset(). Now I'm trying to enable the bloom filter when writing (located in the metadata), but I can find no way to do this. I know in Spark you can do something…
sancholp
  • 67
  • 7
1
vote
1 answer

Using Parquet metatada to find specific key

I have a bunch of Parquet files containing data where each row has the form [key, data1, data2, data3,...]. I need to know in which file a certain key is located, without actually opening each file and searching. Is it possible to get this from the…
sancholp
  • 67
  • 7
1
vote
0 answers

AWS Sagemaker batch transform job with large parquet file and split_type

I'm trying to run a sagemaker batch transform job on a large parquet file (2GB) and I keep having issues with it. In my transformer, I have had to specify split_type='Line' so that I don't get the following error, even when using max_payload=100 Too…
1
vote
0 answers

is that possible to write Parquet file using outputstream?

I have outputstream and i want to create parquet file using this outputstream is that possible to do that?