Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions
20
votes
1 answer

Storing data in HBase vs Parquet files

I am new to big data and am trying to understand the various ways of persisting and retrieving data. I understand both Parquet and HBase are column oriented storage formats but Parquet is a file oriented storage and not a database unlike HBase. My…
sovan
  • 363
  • 1
  • 4
  • 13
20
votes
1 answer

Does any Python library support writing arrays of structs to Parquet files?

I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files…
moonhouse
  • 600
  • 3
  • 20
20
votes
2 answers

create parquet files in java

Is there a way to create parquet files from java? I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill. Is there an simple way to do this, like inserting data into a sql table? GOT…
Imbar M.
  • 1,074
  • 1
  • 10
  • 19
20
votes
1 answer

Can parquet support concurrent write operations?

Is it possible to perform distributed concurrent writes to parquet format? And is it possible to read parquet files while they are being written? If there are methods for concurrent read/writes I'd be interested to learn about.
Loic
  • 1,088
  • 7
  • 19
19
votes
5 answers

How to convert a JSON result to Parquet in python?

Follow the script below to convert a JSON file to parquet format. I am using the pandas library to perform the conversion. However the following error is occurring: AttributeError: 'DataFrame' object has no attribute 'schema' I am still new to…
Mateus Silvestre
  • 191
  • 1
  • 1
  • 3
19
votes
3 answers

Does Spark maintain parquet partitioning on read?

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See…
Adam
  • 313
  • 1
  • 3
  • 11
19
votes
3 answers

Nested data in Parquet with Python

I have a file that has one JSON per line. Here is a sample: { "product": { "id": "abcdef", "price": 19.99, "specs": { "voltage": "110v", "color": "white" } }, "user": "Daniel…
Daniel Severo
  • 1,768
  • 2
  • 15
  • 22
19
votes
3 answers

Dremel - repetition and definition level

Reading Interactive Analysis of Web-Scale Datasets paper, I bumped into the concept of repetition and definition level. while I understand the need for these two, to be able to disambiguate occurrences, it attaches a repetition and definition level…
Tony Tannous
  • 14,154
  • 10
  • 50
  • 86
19
votes
2 answers

Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the…
anivohra
  • 221
  • 2
  • 8
19
votes
3 answers

how to merge multiple parquet files to single parquet file using linux or hdfs command?

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file? what is the best way to do it using some hdfs or linux commands? we used to merge the text files using cat…
Shankar
  • 8,529
  • 26
  • 90
  • 159
19
votes
3 answers

Does Spark support Partition Pruning with Parquet Files

I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark…
Euan
  • 559
  • 4
  • 10
19
votes
2 answers

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks). The log files are CSV so I read them and apply a schema, then perform my transformations. My problem is, how…
Saman
  • 541
  • 2
  • 5
  • 11
19
votes
4 answers

How to read a nested collection in Spark

I have a parquet table with one of the columns being , array> Can run queries against this table in Hive using LATERAL VIEW syntax. How to read this table into an RDD, and more importantly how to filter, map etc this…
Tagar
  • 13,911
  • 6
  • 95
  • 110
19
votes
5 answers

How to split parquet files into many partitions in Spark?

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none…
samthebest
  • 30,803
  • 25
  • 102
  • 142
18
votes
1 answer

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

I know that MSCK REPAIR TABLE updates the metastore with the current partitions of an external table. To do that, you only need to do ls on the root folder of the table (given the table is partitioned by only one column), and get all its partitions,…
gdoron
  • 147,333
  • 58
  • 291
  • 367