Highest Voted 'parquet' Questions

1

vote

0 answers

AvroParquetWriter - addLogicalTypeConversion not working as expected (using version parquet-avro 1.12.3) - causing ClassCastException

I am writing ResultSet to parquet file using AvroParquetWriter. One column in the ResultSet is java.sql.Timestamp. When writing, I get the exception : java.sql.Timestamp cannot be cast to java.lang.Number Adding addLogicalTypeConversion does not…

asked Nov 04 '22 at 08:13

javaseeker

73
1
9

1

vote

0 answers

How to concat few parquet files with same schema?

I have some parquets files - let's say 10 - with same schema. And what I want to do, is merge it to one parquet. I need to have a one parquet file to process it in delta lake faster. I found some options, here on stack, using hive but I don't use…

python pandas pyspark parquet

asked Nov 03 '22 at 08:27

martin

1,145
1
7
24

1

vote

1 answer

pyarrow pq.ParquetFile and related functions throw OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit error

As part of an analysis pipeline I am using around 60 000 parquet files containing each one line of data which must be concatenated. Each file can contain a different set of columns and I need to uniformise them before concatenating them with Ray…

metadata parquet python-3.8 pyarrow apache-arrow

asked Oct 29 '22 at 16:34

Nicolas de Montigny

13
3

1

vote

0 answers

Efficient row-reading of a parquet file in Java

I am writing a program in Java that consumes parquet files and processes them line-by-line. Each file is rather large: roughly 1.3 million rows and 3000 columns of double precision floats, for a file size of about 6.6G. I have tried implementing the…

java hadoop parquet

asked Oct 27 '22 at 17:51

Harry Braviner

627
4
12

1

vote

2 answers

Streaming and caching tabular data with fsspec, parquet and Pyarrow

I’m trying to stream data from parquet files stored in Dropbox (but it could be somewhere else, S3, gdrive, etc…) and reading in Pandas, while caching it. For that I’m trying to use fsspec for Python Following these instructions that’s what I’m…

python parquet pyarrow data-stream fsspec

asked Oct 27 '22 at 11:29

Luiz Tauffer

463
6
17

1

vote

1 answer

is it possible to have one meta file for multiple parquet data files?

I have a process that generates millions of small dataframes and save them to parquet in parallel. all dataframes have the same columns and index information. and have the same number of rows (about 300). as the dataframe is small, when they are…

python dataframe parquet

asked Oct 25 '22 at 22:42

Lei Yu

199
9

1

vote

0 answers

Error in `.rowNamesDF<-`(x, value = value): invalid 'row.names' length when assigning rowname

My data frame is derived from a parquet file and read with the arrow package. Parquet automatically reassigns the index as the last column and now I want to re-index the data frame. How do I assign the last column as the row name? meth_gene <-…

r parquet

asked Oct 22 '22 at 21:29

melolilili

239
1
8

1

vote

2 answers

Is there a tool to query Parquet files which are hosted in S3 storage?

I have Paraquet files in my S3 bucket which is not AWS S3. Is there a tool that connects to any S3 service (like Wasabi, Digital Ocean, MinIO), and allows me to query the Parquet files?

mongodb amazon-s3 digital-ocean parquet wasabi

asked Oct 20 '22 at 08:17

Sasha Chernin

47
7

1

vote

1 answer

Cannot save PySpark dataframe as a parquet file

Complete Python newbie here. I am trying to save a PySpark dataframe as a parquet file, but it's giving me an error. I installed PySpark PySpark version is 3.3.0, Hadoop version 3.2.2, Java jdk1.8.0_351 on PC and created environment variables, as…

python apache-spark pyspark parquet

asked Oct 20 '22 at 05:32

Jae

21
3

1

vote

0 answers

How do I interpret INT96 represented as a byte[12] array in Java, and convert to DateTime?

I'm reading from a parquet file and I noticed per the schema that our dates are being read as INT96 represented as byte[12]. So when reading from the parquet file, a date will look like an Object [0, 0, 0, 0, 0, 0, 0, 0, -63, -120, 37, 0]. Does…

java amazon-s3 parquet avro

asked Oct 19 '22 at 23:17

NuttronIndustries

11
2

1

vote

1 answer

Writing Dataframe to a parquet file but no headers are being written

I have the following code: print(df.show(3)) print(df.columns) df.select('port', 'key', 'return_b', 'return_a', 'return_c', 'return_d', 'return_g').write.format("parquet").save("qwe.parquet") For some reason this doesn't write the Dataframe into…

python-3.x pyspark parquet

asked Oct 17 '22 at 23:22

qwerty

887
11
33

1

vote

1 answer

Pyspark - timestamp col being Null while reading from Parquet Value

I am reading a csv file and writing into a parquet file partitioned by a col. After reading from the csv file this is what i am getting >>> df.printSchema() root |-- col1: string (nullable = true) |-- col5: double (nullable = true) |-- col6:…

csv pyspark timestamp parquet file-read

asked Oct 17 '22 at 15:03

Kaushik Ghosh

121
6

1

vote

0 answers

How to read parquet file from s3 on AWS lambda with Java?

private final S3Client s3Client = S3Client.builder().build(); @Override public Void handleRequest(SQSEvent event, Context context) { for (SQSMessage msg : event.getRecords()) { S3Event s3Event =…

java amazon-s3 aws-lambda inputstream parquet

asked Oct 14 '22 at 20:02

Ftwpker

11
2

1

vote

0 answers

Spark does not seems to write multiple PARQUET files to HDFS concurrently, even though I have multiple cores and parallelism is large?

I have a PySpark job that writes about 1.5TB data to HDFS in PARQUET format. Here are the Spark params: Num of executor: 500 Driver memory: 16G Driver cores: 4 Executor memory: 16G Executor cores:…

apache-spark pyspark parallel-processing hdfs parquet

asked Oct 13 '22 at 19:19

Raulgol

11
2

1

vote

0 answers

Is there a way to read a parquet file from an InputStream in Java?

I am trying to read a parquet record from S3, S3 usually returns an input stream, which I want to retrieve the data from. Am using java , and I don't want to use spark's built in reader. is there a way to do this?

java amazon-s3 bigdata parquet

asked Oct 10 '22 at 12:40

Anas Ahmed

11
4

Questions tagged [parquet]