Highest Voted 'parquet' Questions

32

votes

6 answers

How do you control the size of the output file?

In spark, what is the best way to control file size of the output file. For example, in log4j, we can specify max file size, after which the file rotates. I am looking for similar solution for parquet file. Is there a max file size option…

apache-spark parquet

asked Aug 28 '16 at 02:57

user447359

477
1
5
13

32

votes

6 answers

Reading parquet files from multiple directories in Pyspark

I need to read parquet files from multiple paths that are not parent or child directories. for example, dir1 --- | ------- dir1_1 | ------- dir1_2 dir2 --- | ------- dir2_1 | -------…

pyspark parquet

asked May 16 '16 at 15:04

joshsuihn

770
1
10
25

31

votes

2 answers

How to handle null values when writing to parquet from Spark

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md However it will be a long time before spark…

apache-spark parquet

asked May 03 '18 at 17:19

WestCoastProjects

58,982
91
316
560

31

votes

5 answers

Spark : Read file only if the path exists

I am trying to read the files present at Sequence of Paths in scala. Below is the sample (pseudo) code: val paths = Seq[String] //Seq of paths val dataframe = spark.read.parquet(paths: _*) Now, in the above sequence, some paths exist whereas some…

scala apache-spark parquet

asked Jul 19 '17 at 14:36

Darshan Mehta

30,102
11
68
102

31

votes

6 answers

Parquet without Hadoop?

I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency?

hadoop hdfs parquet

asked Mar 26 '15 at 13:35

capacman

317
1
4
7

30

votes

6 answers

Read multiple parquet files in a folder and write to single csv file using python

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder. I need to read these parquet files starting from file1 in order and…

pandas csv parquet

asked Aug 05 '18 at 17:27

Pri31

447
1
5
9

30

votes

4 answers

Multiple spark jobs appending parquet data to same base path with partitioning

I have multiple jobs that I want to execute in parallel that append daily data into the same path using partitioning. e.g. dataFrame.write(). partitionBy("eventDate", "category") .mode(Append) …

apache-spark parquet

asked Aug 16 '16 at 00:03

vcetinick

1,957
1
19
41

29

votes

3 answers

Spark save(write) parquet only one file

if i write dataFrame.write.format("parquet").mode("append").save("temp.parquet") in temp.parquet folder i got the same file numbers as the row numbers i think i'm not fully understand about parquet but is it natural?

scala apache-spark parquet

asked Aug 01 '18 at 08:44

Easyhyum

307
1
3
5

28

votes

3 answers

How to append data to an existing parquet file

I'm using the following code to create ParquetWriter and to write records to it. ParquetWriter parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE); final GenericRecord record =…

java hadoop parquet

asked Aug 30 '16 at 18:12

Devas

1,544
4
23
28

27

votes

4 answers

Using predicates to filter rows from pyarrow.parquet.ParquetDataset

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow. Here's my attempt: import pyarrow.parquet as pq import s3fs fs =…

python pandas amazon-s3 parquet pyarrow

asked Jun 10 '19 at 08:33

kluu

2,848
3
15
35

27

votes

2 answers

How to identify Pandas' backend for Parquet

I understand that Pandas can read and write to and from Parquet files using different backends: pyarrow and fastparquet. I have a Conda distribution with the Intel distribution and "it works": I can use pandas.DataFrame.to_parquet. However I do not…

python pandas parquet

asked Jun 08 '18 at 12:09

Cedric H.

7,980
10
55
82

27

votes

6 answers

Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?

Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable or DataFrameWriter.options and they would affect the saving of a Hive table. My hope is that in the answers to this…

scala hadoop apache-spark hive parquet

asked Jul 18 '15 at 02:43

Sim

13,147
9
66
95

26

votes

3 answers

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates because some column types do not match and I get one of…

apache-spark apache-spark-sql parquet amazon-emr

asked Dec 02 '16 at 07:52

V. Samma

2,558
8
30
34

26

votes

2 answers

Spark lists all leaf node even in partitioned data

I have parquet data partitioned by date & hour, folder structure: events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz I have…

apache-spark amazon-s3 apache-spark-sql partitioning parquet

asked Sep 15 '16 at 14:19

Gaurav Shah

5,223
7
43
71

26

votes

4 answers

Can we load Parquet file into Hive directly?

I know we can load parquet file using Spark SQL and using Impala but wondering if we can do the same using Hive. I have been reading many articles but I am still confused. Simply put, I have a parquet file - say users.parquet. Now I am struck here…

hadoop hive apache-spark-sql hiveql parquet

asked Dec 16 '15 at 03:16

annunarcist

1,637
3
20
42

Questions tagged [parquet]