Highest Voted 'parquet' Questions

1

vote

0 answers

Spark Dataframe - java.lang.ArrayIndexOutOfBoundsException: 26 while writing to S3

I am trying to read some data from a snowflake table using Spark-Snowflake connector and write the data to S3 after performing some transformations. But I am seeing the error java.lang.ArrayIndexOutOfBoundsException: 26 when the line…

asked Dec 20 '22 at 20:30

Hemanth Annavarapu

823
3
19
37

1

vote

1 answer

How can I write Parquet files with int64 timestamps (instead of int96) from AWS Kinesis Firehose?

Why do int96 timestamps not work for me? I want to read the Parquet files with S3 Select. S3 Select does not support timestamps saved as int96 according to the documentation. Also, storing timestamps in parquet as int96 is deprecated. What did I…

hive aws-glue parquet amazon-kinesis-firehose amazon-s3-select

asked Dec 16 '22 at 13:51

Faber

1,504
2
13
21

1

vote

0 answers

Retain pandas dtype 'category' when using parquet file

I am using parquet to store pandas dataframes, and would like to keep the dtype of columns. However It sometimes ins't working: here is the example code: import pandas as pd import numpy as np df = pd.DataFrame({ 'a': [pd.NA, 'a', 'b', 'c'], …

python pandas parquet

asked Dec 12 '22 at 15:00

Leo

1,176
1
13
33

1

vote

0 answers

Reading parquet files from gs or s3a fails with ClassNotFoundException: org.apache.hadoop.conf.Configuration

I'm using Flink 1.16.0 with Kotlin to read and process (snappy-compressed) parquet files that were generated by Spark, and I keep running into ClassNotFoundException: org.apache.hadoop.conf.Configuration. The files are on Google Cloud Storage/gs://,…

apache-flink parquet avro

asked Nov 28 '22 at 10:27

McPeanutbutter

71
4

1

vote

1 answer

How to store and load multi-column index pandas dataframes with parquet

I have a dataset similar to: initial_df = pd.DataFrame([{'a': 0, 'b': 0, 'c': 10.898}, {'a': 0, 'b': 1, 'c': 1.88}, {'a': 1, 'b': 0, 'c': 108.1}, {'a': 1, 'b': 1, 'c': 10.898}]) initial_df.set_index(['a', 'b'], inplace=True) I am able to store it…

python python-3.x pandas parquet fastparquet

asked Nov 27 '22 at 17:20

KerikoN

26
4

1

vote

1 answer

Can a parquet file exceed 2.1GB?

I'm having an issue storing a large dataset (around 40GB) in a single parquet file. I'm using the fastparquet library to append pandas.DataFrames to this parquet dataset file. The following is a minimal example program that appends chunks to a…

python machine-learning dataset parquet fastparquet

asked Nov 24 '22 at 14:39

Alex Pilafian

121
1
5

1

vote

1 answer

PyArrow: How to batch data from mongo into partitioned parquet in S3

I want to be able to archive my data from Mongo into S3. Currently, what I do is Read data from Mongo Convert this into a pyarrow Table Write to S3 It works for now, but steps 1 and 2 is kind of a bulk thing where if the result set is huge it…

mongodb amazon-s3 parquet pyarrow apache-arrow

asked Nov 23 '22 at 02:43

Jiew Meng

84,767
185
495
805

1

vote

0 answers

How to get Distinct count from parquet file without reading whole file Using Java ParquetFileReader

I need to get information like get maximum length of a string if the type is string and distinct count from parquet without reading the whole file in Java not using Avro Parquet reader. ParquetFileReader parquetFileReader =…

java parquet

asked Nov 21 '22 at 04:05

maya

21
1

1

vote

0 answers

How to null values in a parquet in Scala without Spark

I am looking for a way to read a parquet file, replace values with null in some columns where condition matches and write the data back to the original file. Using Spark it's pretty easy, but I want to achieve this without it. This is how I would do…

scala null parquet

asked Nov 18 '22 at 09:12

Deividas Skiparis

11
1

1

vote

2 answers

Trying to filter in dask.read_parquet tries to compare NoneType and str

I have a project where I pass the following load_args to read_parquet: filters = {'filters': [('itemId', '=', '9403cfde-7fe5-4c9c-916c-41ff0b595c5c')]} According to the documentation, a List[Tuple] like this should be accepted and I should get all…

python dask parquet

asked Nov 17 '22 at 16:21

filpa

3,651
8
52
91

1

vote

0 answers

AWS GLue Spark job: Found duplicate column(s) in the data schema and the partition schema: `day`, `month`, `year`

Glue spark job is failing with the error message: AnalysisException: Found duplicate column(s) in the data schema and the partition schema: day, month, year. And my actual parquet data file in S3 includes these partition columns as well. Code…

pyspark aws-glue parquet

asked Nov 17 '22 at 11:00

Raaj

29
2
6

1

vote

0 answers

Setting up Bloom filter with PyArrow

I'm writing some datasets to parquet using pyarrow.parquet.write_to_dataset(). Now I'm trying to enable the bloom filter when writing (located in the metadata), but I can find no way to do this. I know in Spark you can do something…

parquet pyarrow

asked Nov 14 '22 at 16:51

sancholp

67
7

1

vote

1 answer

Using Parquet metatada to find specific key

I have a bunch of Parquet files containing data where each row has the form [key, data1, data2, data3,...]. I need to know in which file a certain key is located, without actually opening each file and searching. Is it possible to get this from the…

parquet

asked Nov 09 '22 at 13:25

sancholp

67
7

1

vote

0 answers

AWS Sagemaker batch transform job with large parquet file and split_type

I'm trying to run a sagemaker batch transform job on a large parquet file (2GB) and I keep having issues with it. In my transformer, I have had to specify split_type='Line' so that I don't get the following error, even when using max_payload=100 Too…

python amazon-web-services parquet amazon-sagemaker

asked Nov 04 '22 at 20:43

Jonathon K

55
5

1

vote

0 answers

is that possible to write Parquet file using outputstream?

I have outputstream and i want to create parquet file using this outputstream is that possible to do that?

parquet

asked Nov 04 '22 at 19:53

raviston Thanasekar

85
8

Questions tagged [parquet]