Highest Voted 'parquet' Questions

15

votes

2 answers

Is there a way to directly insert data from a parquet file into PostgreSQL database?

I'm trying to restore some historic backup files that saved in parquet format, and I want to read from them once and write the data into a PostgreSQL database. I know that backup files saved using spark, but there is a strict restriction for me that…

asked Nov 10 '19 at 08:05

Javad Bahoosh

400
1
3
16

15

votes

6 answers

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

How do I read a partitioned parquet file into R with arrow (without any spark) The situation created parquet files with a Spark pipe and save on S3 read with RStudio/RShiny with one column as index to do further analysis The parquet file…

r parquet apache-arrow

asked Oct 17 '19 at 20:02

Alex Ortner

1,097
8
24

15

votes

4 answers

How to handle small file problem in spark structured streaming?

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into…

apache-spark apache-spark-sql spark-streaming parquet

asked Jun 10 '19 at 10:23

BdEngineer

2,929
4
49
85

15

votes

2 answers

What are the compression types supported in parquet

I was writing data on Hadoop and hive in parquet format using spark. I want to enable compression but i can only find 2 types on compression - snappy and Gzip being used most of the times. Does parquet support any other compression like Deflate and…

apache-spark hadoop hive compression parquet

asked Jul 06 '18 at 05:40

User_qwerty

375
1
2
10

15

votes

1 answer

Storing multiple dataframes of different widths with Parquet?

Does Parquet support storing various data frames of different widths (numbers of columns) in a single file? E.g. in HDF5 it is possible to store multiple such data frames and access them by key. So far it looks from my reading that Parquet does not…

python pandas apache-spark parquet

asked May 21 '18 at 21:10

Turo

1,537
2
21
42

15

votes

2 answers

How to read a parquet file in R without using spark packages?

I could find many answers online by using sparklyr or using different spark packages which actually requires spinning up a spark cluster which is an overhead. In python I could find a way to do this using "pandas.read_parquet" or Apache arrow in…

r parquet

asked May 10 '18 at 19:08

Gerg

336
4
14

15

votes

2 answers

Spark + Parquet + Snappy: Overall compression ratio loses after spark shuffles data

Commmunity! Please help me understand how to get better compression ratio with Spark? Let me describe case: I have dataset, let's call it product on HDFS which was imported using Sqoop ImportTool as-parquet-file using codec snappy. As result of…

apache-spark apache-spark-sql parquet snappy

asked Feb 18 '18 at 01:43

Mikhail Dubkov

1,223
1
12
16

15

votes

6 answers

AWS Glue Crawler adding tables for every partition?

I have several thousand files in an S3 bucket in this form: ├── bucket │ ├── somedata │ │ ├── year=2016 │ │ ├── year=2017 │ │ │ ├── month=11 │ │ | │ ├── sometype-2017-11-01.parquet │ | | | ├──…

amazon-web-services parquet aws-glue

asked Jan 22 '18 at 00:10

chazzmoney

221
2
9

15

votes

4 answers

Documentation for Apache's Parquet Java API?

I would like to use Apache's parquet-mr project to read/write Parquet files programmatically with Java. I can't seem to find any documentation for how to use this API (aside from going through the source code and seeing how it's used) -- just…

parquet

asked May 02 '17 at 17:39

Jason Evans

1,197
1
13
22

15

votes

3 answers

spark parquet write gets slow as partitions grow

I have a spark streaming application that writes parquet data from stream. sqlContext.sql( """ |select |to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')) as event_date, …

apache-spark partitioning parquet

asked Sep 16 '16 at 06:46

Gaurav Shah

5,223
7
43
71

15

votes

1 answer

Generate metadata for parquet files

I have a hive table that is built on top of a load of external parquet files. Parquet files should be generated by the spark job, but due to setting metadata flag to false they were not generated. I'm wondering if it is possible to restore it in…

hadoop apache-spark hive parquet

asked May 19 '16 at 15:21

TheMP

8,257
9
44
73

15

votes

4 answers

Read few parquet files at the same time in Spark

I can read few json-files at the same time using * (star): sqlContext.jsonFile('/path/to/dir/*.json') Is there any way to do the same thing for parquet? Star doesn't works.

apache-spark parquet

asked May 24 '15 at 07:38

SkyFox

1,805
4
22
33

15

votes

3 answers

EntityTooLarge error when uploading a 5G file to Amazon S3

Amazon S3 file size limit is supposed to be 5T according to this announcement, but I am getting the following error when uploading a 5G…

amazon-s3 apache-spark jets3t parquet apache-spark-sql

asked Oct 11 '14 at 22:15

Daniel Mahler

7,653
5
51
90

14

votes

5 answers

pandas df.to_parquet write to multiple smaller files

Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size? I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow',…

pandas save parquet pyarrow snappy

asked Sep 06 '20 at 20:33

Austin

6,921
12
73
138

14

votes

2 answers

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds). Throughout the examples we use: import pandas as pd import pyarrow as pa Here's a minimal example to show the situation: df =…

python pandas dataframe parquet pyarrow

asked Apr 17 '20 at 12:08

Silver Duck

581
1
5
18

Questions tagged [parquet]