Highest Voted 'parquet' Questions

25

votes

4 answers

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load().select(...col1, col2) the best way to do that? I would also prefer to use…

apache-spark parquet

asked Jan 24 '18 at 12:08

horatio1701d

8,809
14
48
77

23

votes

3 answers

Overwrite parquet files from dynamic frame in AWS Glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame.from_options(frame = table, …

amazon-web-services parquet aws-glue

asked Aug 24 '18 at 09:47

Mateo Rod

544
2
6
14

23

votes

7 answers

pandas write dataframe to parquet format with append

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing? the write…

python apache pandas parquet

asked Nov 08 '17 at 23:48

Siraj S.

3,481
3
34
48

23

votes

4 answers

How to view a parquet file in intellij

I want to open a parquet file and view the contents of the table in Intellij. Is there a way to do this currently or with a plugin?

intellij-idea parquet

asked Feb 08 '17 at 01:13

nobody

7,803
11
56
91

23

votes

2 answers

Why does Apache Spark read unnecessary Parquet columns within nested structures?

My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing…

apache-spark apache-spark-sql parquet

asked Oct 21 '16 at 20:34

Peter Stephens

1,040
1
9
23

23

votes

2 answers

how to read a parquet file, in a standalone java code?

the parquet docs from cloudera shows examples of integration with pig/hive/impala. but in many cases I want to read the parquet file itself for debugging purposes. is there a straightforward java reader api to read a parquet file ? Thanks Yang

java parquet

asked Feb 19 '15 at 19:44

teddy teddy

3,025
6
31
48

22

votes

4 answers

pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')

Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code import pandas as pd import pyarrow as pa class Player: def __init__(self, name, age, gender): self.name = name …

python pandas parquet pyarrow fastparquet

asked Jan 07 '20 at 22:07

Nyxynyx

61,411
155
482
830

22

votes

5 answers

Is it possible to read parquet files in chunks?

For example, pandas's read_csv has a chunk_size argument which allows the read_csv to return an iterator on the CSV file so we can read it in chunks. The Parquet format stores the data in chunks, but there isn't a documented way to read in it chunks…

parquet

asked Nov 29 '19 at 04:26

xiaodai

14,889
18
76
140

22

votes

2 answers

How to write Parquet metadata with pyarrow?

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet seems to support file-wide metadata, but I cannot…

python parquet pyarrow

asked Aug 31 '18 at 21:15

golobor

1,208
11
10

22

votes

7 answers

GUI tools for viewing/editing Apache Parquet

I have some Apache Parquet file. I know I can execute parquet file.parquet in my shell and view it in terminal. But I would like some GUI tool to view Parquet files in more user-friendly format. Does such kind of program exist?

apache hadoop parquet

asked Mar 19 '18 at 16:03

Roman Zavodskikh

513
1
6
14

22

votes

4 answers

Using Spark to write a parquet file to s3 over s3a is very slow

I'm trying to write a parquet file out to Amazon S3 using Spark 1.6.1. The small parquet that I'm generating is ~2GB once written so it's not that much data. I'm trying to prove Spark out as a platform that I can use. Basically what I'm going is…

scala amazon-s3 apache-spark apache-spark-sql parquet

asked Apr 29 '16 at 01:05

Brutus35

573
2
6
12

22

votes

4 answers

Updating values in apache parquet file

I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier…

apache-spark parquet

asked Mar 03 '15 at 16:54

marcin_koss

5,763
10
46
65

21

votes

4 answers

PySpark: org.apache.spark.sql.AnalysisException: Attribute name ... contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it

I'm trying to load Parquet data into PySpark, where a column has a space in the name: df = spark.read.parquet('my_parquet_dump') df.select(df['Foo Bar'].alias('foobar')) Even though I have aliased the column, I'm still getting this error and error…

python apache-spark pyspark apache-spark-sql parquet

asked Aug 21 '17 at 19:46

munro

341
1
3
6

21

votes

2 answers

Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are…

java json hadoop parquet

asked Oct 04 '16 at 17:58

vijju

415
1
5
9

21

votes

7 answers

Spark Dataframe validating column names for parquet writes

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format. However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the…

apache-spark pyspark apache-spark-sql spark-streaming parquet

asked Jul 04 '16 at 19:26

codehammer

876
2
10
27

Questions tagged [parquet]