Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions
15
votes
2 answers

Is there a way to directly insert data from a parquet file into PostgreSQL database?

I'm trying to restore some historic backup files that saved in parquet format, and I want to read from them once and write the data into a PostgreSQL database. I know that backup files saved using spark, but there is a strict restriction for me that…
Javad Bahoosh
  • 400
  • 1
  • 3
  • 16
15
votes
6 answers

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

How do I read a partitioned parquet file into R with arrow (without any spark) The situation created parquet files with a Spark pipe and save on S3 read with RStudio/RShiny with one column as index to do further analysis The parquet file…
Alex Ortner
  • 1,097
  • 8
  • 24
15
votes
4 answers

How to handle small file problem in spark structured streaming?

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into…
BdEngineer
  • 2,929
  • 4
  • 49
  • 85
15
votes
2 answers

What are the compression types supported in parquet

I was writing data on Hadoop and hive in parquet format using spark. I want to enable compression but i can only find 2 types on compression - snappy and Gzip being used most of the times. Does parquet support any other compression like Deflate and…
User_qwerty
  • 375
  • 1
  • 2
  • 10
15
votes
1 answer

Storing multiple dataframes of different widths with Parquet?

Does Parquet support storing various data frames of different widths (numbers of columns) in a single file? E.g. in HDF5 it is possible to store multiple such data frames and access them by key. So far it looks from my reading that Parquet does not…
Turo
  • 1,537
  • 2
  • 21
  • 42
15
votes
2 answers

How to read a parquet file in R without using spark packages?

I could find many answers online by using sparklyr or using different spark packages which actually requires spinning up a spark cluster which is an overhead. In python I could find a way to do this using "pandas.read_parquet" or Apache arrow in…
Gerg
  • 336
  • 4
  • 14
15
votes
2 answers

Spark + Parquet + Snappy: Overall compression ratio loses after spark shuffles data

Commmunity! Please help me understand how to get better compression ratio with Spark? Let me describe case: I have dataset, let's call it product on HDFS which was imported using Sqoop ImportTool as-parquet-file using codec snappy. As result of…
Mikhail Dubkov
  • 1,223
  • 1
  • 12
  • 16
15
votes
6 answers

AWS Glue Crawler adding tables for every partition?

I have several thousand files in an S3 bucket in this form: ├── bucket │ ├── somedata │ │   ├── year=2016 │ │   ├── year=2017 │ │   │   ├── month=11 │ │   | │   ├── sometype-2017-11-01.parquet │ | | | ├──…
chazzmoney
  • 221
  • 2
  • 9
15
votes
4 answers

Documentation for Apache's Parquet Java API?

I would like to use Apache's parquet-mr project to read/write Parquet files programmatically with Java. I can't seem to find any documentation for how to use this API (aside from going through the source code and seeing how it's used) -- just…
Jason Evans
  • 1,197
  • 1
  • 13
  • 22
15
votes
3 answers

spark parquet write gets slow as partitions grow

I have a spark streaming application that writes parquet data from stream. sqlContext.sql( """ |select |to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')) as event_date, …
Gaurav Shah
  • 5,223
  • 7
  • 43
  • 71
15
votes
1 answer

Generate metadata for parquet files

I have a hive table that is built on top of a load of external parquet files. Parquet files should be generated by the spark job, but due to setting metadata flag to false they were not generated. I'm wondering if it is possible to restore it in…
TheMP
  • 8,257
  • 9
  • 44
  • 73
15
votes
4 answers

Read few parquet files at the same time in Spark

I can read few json-files at the same time using * (star): sqlContext.jsonFile('/path/to/dir/*.json') Is there any way to do the same thing for parquet? Star doesn't works.
SkyFox
  • 1,805
  • 4
  • 22
  • 33
15
votes
3 answers

EntityTooLarge error when uploading a 5G file to Amazon S3

Amazon S3 file size limit is supposed to be 5T according to this announcement, but I am getting the following error when uploading a 5G…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
14
votes
5 answers

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow',…
Austin
  • 6,921
  • 12
  • 73
  • 138
14
votes
2 answers

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds). Throughout the examples we use: import pandas as pd import pyarrow as pa Here's a minimal example to show the situation: df =…
Silver Duck
  • 581
  • 1
  • 5
  • 18