Highest Voted 'parquet' Questions

20

votes

1 answer

Storing data in HBase vs Parquet files

I am new to big data and am trying to understand the various ways of persisting and retrieving data. I understand both Parquet and HBase are column oriented storage formats but Parquet is a file oriented storage and not a database unlike HBase. My…

asked Sep 09 '18 at 08:00

sovan

363
1
4
13

20

votes

1 answer

Does any Python library support writing arrays of structs to Parquet files?

I want to write data where some columns are arrays of strings or arrays of structs (typically key-value pairs) into a Parquet file for use in AWS Athena. After finding two Python libraries (Arrow and fastparquet) supporting writing to Parquet files…

python parquet pyarrow fastparquet

asked Jun 15 '18 at 13:21

moonhouse

600
3
20

20

votes

2 answers

create parquet files in java

Is there a way to create parquet files from java? I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill. Is there an simple way to do this, like inserting data into a sql table? GOT…

java parquet

asked Sep 27 '16 at 15:38

Imbar M.

1,074
1
10
19

20

votes

1 answer

Can parquet support concurrent write operations?

Is it possible to perform distributed concurrent writes to parquet format? And is it possible to read parquet files while they are being written? If there are methods for concurrent read/writes I'd be interested to learn about.

parquet

asked Aug 09 '15 at 22:58

Loic

1,088
7
19

19

votes

5 answers

How to convert a JSON result to Parquet in python?

Follow the script below to convert a JSON file to parquet format. I am using the pandas library to perform the conversion. However the following error is occurring: AttributeError: 'DataFrame' object has no attribute 'schema' I am still new to…

python json parquet

asked Dec 02 '19 at 15:13

Mateus Silvestre

191
1
1
3

19

votes

3 answers

Does Spark maintain parquet partitioning on read?

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See…

scala apache-spark partitioning parquet

asked Jun 12 '18 at 21:06

Adam

313
1
3
11

19

votes

3 answers

Nested data in Parquet with Python

I have a file that has one JSON per line. Here is a sample: { "product": { "id": "abcdef", "price": 19.99, "specs": { "voltage": "110v", "color": "white" } }, "user": "Daniel…

python json parquet dask

asked Jul 27 '17 at 04:01

Daniel Severo

1,768
2
15
22

19

votes

3 answers

Dremel - repetition and definition level

Reading Interactive Analysis of Web-Scale Datasets paper, I bumped into the concept of repetition and definition level. while I understand the need for these two, to be able to disambiguate occurrences, it attaches a repetition and definition level…

algorithm data-structures dataset parquet dremel

asked Apr 23 '17 at 06:35

Tony Tannous

14,154
10
50
86

19

votes

2 answers

Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the…

hadoop apache-spark amazon-s3 amazon-emr parquet

asked Sep 22 '16 at 04:11

anivohra

221
2
8

19

votes

3 answers

how to merge multiple parquet files to single parquet file using linux or hdfs command?

I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file? what is the best way to do it using some hdfs or linux commands? we used to merge the text files using cat…

hdfs parquet

asked Jul 27 '16 at 10:49

Shankar

8,529
26
90
159

19

votes

3 answers

Does Spark support Partition Pruning with Parquet Files

I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id. The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark…

apache-spark amazon-s3 hive parquet

asked May 12 '16 at 07:28

Euan

559
4
10

19

votes

2 answers

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks). The log files are CSV so I read them and apply a schema, then perform my transformations. My problem is, how…

scala apache-spark append parquet

asked Jan 21 '16 at 22:15

Saman

541
2
5
11

19

votes

4 answers

How to read a nested collection in Spark

I have a parquet table with one of the columns being , array> Can run queries against this table in Hive using LATERAL VIEW syntax. How to read this table into an RDD, and more importantly how to filter, map etc this…

apache-spark apache-spark-sql nested parquet lateral-join

asked May 02 '15 at 22:20

Tagar

13,911
6
95
110

19

votes

5 answers

How to split parquet files into many partitions in Spark?

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none…

scala apache-spark parquet

asked Nov 28 '14 at 18:37

samthebest

30,803
25
102
142

18

votes

1 answer

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

I know that MSCK REPAIR TABLE updates the metastore with the current partitions of an external table. To do that, you only need to do ls on the root folder of the table (given the table is partitioned by only one column), and get all its partitions,…

amazon-web-services hive hdfs parquet presto

asked Dec 07 '18 at 10:26

gdoron

147,333
58
291
367

Questions tagged [parquet]