Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions
32
votes
6 answers

How do you control the size of the output file?

In spark, what is the best way to control file size of the output file. For example, in log4j, we can specify max file size, after which the file rotates. I am looking for similar solution for parquet file. Is there a max file size option…
user447359
  • 477
  • 1
  • 5
  • 13
32
votes
6 answers

Reading parquet files from multiple directories in Pyspark

I need to read parquet files from multiple paths that are not parent or child directories. for example, dir1 --- | ------- dir1_1 | ------- dir1_2 dir2 --- | ------- dir2_1 | -------…
joshsuihn
  • 770
  • 1
  • 10
  • 25
31
votes
2 answers

How to handle null values when writing to parquet from Spark

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md However it will be a long time before spark…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
31
votes
5 answers

Spark : Read file only if the path exists

I am trying to read the files present at Sequence of Paths in scala. Below is the sample (pseudo) code: val paths = Seq[String] //Seq of paths val dataframe = spark.read.parquet(paths: _*) Now, in the above sequence, some paths exist whereas some…
Darshan Mehta
  • 30,102
  • 11
  • 68
  • 102
31
votes
6 answers

Parquet without Hadoop?

I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency?
capacman
  • 317
  • 1
  • 4
  • 7
30
votes
6 answers

Read multiple parquet files in a folder and write to single csv file using python

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder. I need to read these parquet files starting from file1 in order and…
Pri31
  • 447
  • 1
  • 5
  • 9
30
votes
4 answers

Multiple spark jobs appending parquet data to same base path with partitioning

I have multiple jobs that I want to execute in parallel that append daily data into the same path using partitioning. e.g. dataFrame.write(). partitionBy("eventDate", "category") .mode(Append) …
vcetinick
  • 1,957
  • 1
  • 19
  • 41
29
votes
3 answers

Spark save(write) parquet only one file

if i write dataFrame.write.format("parquet").mode("append").save("temp.parquet") in temp.parquet folder i got the same file numbers as the row numbers i think i'm not fully understand about parquet but is it natural?
Easyhyum
  • 307
  • 1
  • 3
  • 5
28
votes
3 answers

How to append data to an existing parquet file

I'm using the following code to create ParquetWriter and to write records to it. ParquetWriter parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE); final GenericRecord record =…
Devas
  • 1,544
  • 4
  • 23
  • 28
27
votes
4 answers

Using predicates to filter rows from pyarrow.parquet.ParquetDataset

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow. Here's my attempt: import pyarrow.parquet as pq import s3fs fs =…
kluu
  • 2,848
  • 3
  • 15
  • 35
27
votes
2 answers

How to identify Pandas' backend for Parquet

I understand that Pandas can read and write to and from Parquet files using different backends: pyarrow and fastparquet. I have a Conda distribution with the Intel distribution and "it works": I can use pandas.DataFrame.to_parquet. However I do not…
Cedric H.
  • 7,980
  • 10
  • 55
  • 82
27
votes
6 answers

Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?

Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable or DataFrameWriter.options and they would affect the saving of a Hive table. My hope is that in the answers to this…
Sim
  • 13,147
  • 9
  • 66
  • 95
26
votes
3 answers

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates because some column types do not match and I get one of…
V. Samma
  • 2,558
  • 8
  • 30
  • 34
26
votes
2 answers

Spark lists all leaf node even in partitioned data

I have parquet data partitioned by date & hour, folder structure: events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz I have…
Gaurav Shah
  • 5,223
  • 7
  • 43
  • 71
26
votes
4 answers

Can we load Parquet file into Hive directly?

I know we can load parquet file using Spark SQL and using Impala but wondering if we can do the same using Hive. I have been reading many articles but I am still confused. Simply put, I have a parquet file - say users.parquet. Now I am struck here…
annunarcist
  • 1,637
  • 3
  • 20
  • 42