Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions
25
votes
4 answers

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load().select(...col1, col2) the best way to do that? I would also prefer to use…
horatio1701d
  • 8,809
  • 14
  • 48
  • 77
23
votes
3 answers

Overwrite parquet files from dynamic frame in AWS Glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame.from_options(frame = table, …
Mateo Rod
  • 544
  • 2
  • 6
  • 14
23
votes
7 answers

pandas write dataframe to parquet format with append

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing? the write…
Siraj S.
  • 3,481
  • 3
  • 34
  • 48
23
votes
4 answers

How to view a parquet file in intellij

I want to open a parquet file and view the contents of the table in Intellij. Is there a way to do this currently or with a plugin?
nobody
  • 7,803
  • 11
  • 56
  • 91
23
votes
2 answers

Why does Apache Spark read unnecessary Parquet columns within nested structures?

My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing…
Peter Stephens
  • 1,040
  • 1
  • 9
  • 23
23
votes
2 answers

how to read a parquet file, in a standalone java code?

the parquet docs from cloudera shows examples of integration with pig/hive/impala. but in many cases I want to read the parquet file itself for debugging purposes. is there a straightforward java reader api to read a parquet file ? Thanks Yang
teddy teddy
  • 3,025
  • 6
  • 31
  • 48
22
votes
4 answers

pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')

Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code import pandas as pd import pyarrow as pa class Player: def __init__(self, name, age, gender): self.name = name …
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830
22
votes
5 answers

Is it possible to read parquet files in chunks?

For example, pandas's read_csv has a chunk_size argument which allows the read_csv to return an iterator on the CSV file so we can read it in chunks. The Parquet format stores the data in chunks, but there isn't a documented way to read in it chunks…
xiaodai
  • 14,889
  • 18
  • 76
  • 140
22
votes
2 answers

How to write Parquet metadata with pyarrow?

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet seems to support file-wide metadata, but I cannot…
golobor
  • 1,208
  • 11
  • 10
22
votes
7 answers

GUI tools for viewing/editing Apache Parquet

I have some Apache Parquet file. I know I can execute parquet file.parquet in my shell and view it in terminal. But I would like some GUI tool to view Parquet files in more user-friendly format. Does such kind of program exist?
Roman Zavodskikh
  • 513
  • 1
  • 6
  • 14
22
votes
4 answers

Using Spark to write a parquet file to s3 over s3a is very slow

I'm trying to write a parquet file out to Amazon S3 using Spark 1.6.1. The small parquet that I'm generating is ~2GB once written so it's not that much data. I'm trying to prove Spark out as a platform that I can use. Basically what I'm going is…
Brutus35
  • 573
  • 2
  • 6
  • 12
22
votes
4 answers

Updating values in apache parquet file

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier…
marcin_koss
  • 5,763
  • 10
  • 46
  • 65
21
votes
4 answers

PySpark: org.apache.spark.sql.AnalysisException: Attribute name ... contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it

I'm trying to load Parquet data into PySpark, where a column has a space in the name: df = spark.read.parquet('my_parquet_dump') df.select(df['Foo Bar'].alias('foobar')) Even though I have aliased the column, I'm still getting this error and error…
munro
  • 341
  • 1
  • 3
  • 6
21
votes
2 answers

Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are…
vijju
  • 415
  • 1
  • 5
  • 9
21
votes
7 answers

Spark Dataframe validating column names for parquet writes

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format. However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the…