Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions
1
vote
0 answers

AvroParquetWriter - addLogicalTypeConversion not working as expected (using version parquet-avro 1.12.3) - causing ClassCastException

I am writing ResultSet to parquet file using AvroParquetWriter. One column in the ResultSet is java.sql.Timestamp. When writing, I get the exception : java.sql.Timestamp cannot be cast to java.lang.Number Adding addLogicalTypeConversion does not…
1
vote
0 answers

How to concat few parquet files with same schema?

I have some parquets files - let's say 10 - with same schema. And what I want to do, is merge it to one parquet. I need to have a one parquet file to process it in delta lake faster. I found some options, here on stack, using hive but I don't use…
martin
  • 1,145
  • 1
  • 7
  • 24
1
vote
1 answer

pyarrow pq.ParquetFile and related functions throw OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit error

As part of an analysis pipeline I am using around 60 000 parquet files containing each one line of data which must be concatenated. Each file can contain a different set of columns and I need to uniformise them before concatenating them with Ray…
1
vote
0 answers

Efficient row-reading of a parquet file in Java

I am writing a program in Java that consumes parquet files and processes them line-by-line. Each file is rather large: roughly 1.3 million rows and 3000 columns of double precision floats, for a file size of about 6.6G. I have tried implementing the…
Harry Braviner
  • 627
  • 4
  • 12
1
vote
2 answers

Streaming and caching tabular data with fsspec, parquet and Pyarrow

I’m trying to stream data from parquet files stored in Dropbox (but it could be somewhere else, S3, gdrive, etc…) and reading in Pandas, while caching it. For that I’m trying to use fsspec for Python Following these instructions that’s what I’m…
Luiz Tauffer
  • 463
  • 6
  • 17
1
vote
1 answer

is it possible to have one meta file for multiple parquet data files?

I have a process that generates millions of small dataframes and save them to parquet in parallel. all dataframes have the same columns and index information. and have the same number of rows (about 300). as the dataframe is small, when they are…
Lei Yu
  • 199
  • 9
1
vote
0 answers

Error in `.rowNamesDF<-`(x, value = value): invalid 'row.names' length when assigning rowname

My data frame is derived from a parquet file and read with the arrow package. Parquet automatically reassigns the index as the last column and now I want to re-index the data frame. How do I assign the last column as the row name? meth_gene <-…
melolilili
  • 239
  • 1
  • 8
1
vote
2 answers

Is there a tool to query Parquet files which are hosted in S3 storage?

I have Paraquet files in my S3 bucket which is not AWS S3. Is there a tool that connects to any S3 service (like Wasabi, Digital Ocean, MinIO), and allows me to query the Parquet files?
1
vote
1 answer

Cannot save PySpark dataframe as a parquet file

Complete Python newbie here. I am trying to save a PySpark dataframe as a parquet file, but it's giving me an error. I installed PySpark PySpark version is 3.3.0, Hadoop version 3.2.2, Java jdk1.8.0_351 on PC and created environment variables, as…
Jae
  • 21
  • 3
1
vote
0 answers

How do I interpret INT96 represented as a byte[12] array in Java, and convert to DateTime?

I'm reading from a parquet file and I noticed per the schema that our dates are being read as INT96 represented as byte[12]. So when reading from the parquet file, a date will look like an Object [0, 0, 0, 0, 0, 0, 0, 0, -63, -120, 37, 0]. Does…
1
vote
1 answer

Writing Dataframe to a parquet file but no headers are being written

I have the following code: print(df.show(3)) print(df.columns) df.select('port', 'key', 'return_b', 'return_a', 'return_c', 'return_d', 'return_g').write.format("parquet").save("qwe.parquet") For some reason this doesn't write the Dataframe into…
qwerty
  • 887
  • 11
  • 33
1
vote
1 answer

Pyspark - timestamp col being Null while reading from Parquet Value

I am reading a csv file and writing into a parquet file partitioned by a col. After reading from the csv file this is what i am getting >>> df.printSchema() root |-- col1: string (nullable = true) |-- col5: double (nullable = true) |-- col6:…
1
vote
0 answers

How to read parquet file from s3 on AWS lambda with Java?

private final S3Client s3Client = S3Client.builder().build(); @Override public Void handleRequest(SQSEvent event, Context context) { for (SQSMessage msg : event.getRecords()) { S3Event s3Event =…
Ftwpker
  • 11
  • 2
1
vote
0 answers

Spark does not seems to write multiple PARQUET files to HDFS concurrently, even though I have multiple cores and parallelism is large?

I have a PySpark job that writes about 1.5TB data to HDFS in PARQUET format. Here are the Spark params: Num of executor: 500 Driver memory: 16G Driver cores: 4 Executor memory: 16G Executor cores:…
1
vote
0 answers

Is there a way to read a parquet file from an InputStream in Java?

I am trying to read a parquet record from S3, S3 usually returns an input stream, which I want to retrieve the data from. Am using java , and I don't want to use spark's built in reader. is there a way to do this?
Anas Ahmed
  • 11
  • 4
1 2 3
99
100