Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions
18
votes
1 answer

How to force parquet dtypes when saving pd.DataFrame?

Is there a way to force a parquet file to encode a pd.DataFrame column as a given type, even though all values for the column are null? The fact that parquet automatically assigns "null" in its schema is preventing me from loading many files into a…
HugoMailhot
  • 1,275
  • 1
  • 10
  • 19
18
votes
3 answers

Parquet Writer to buffer or byte stream

I have a java application that converts json messages to parquet format. Is there any parquet writer which writes to buffer or byte stream in java? Most of the examples, I have seen write to files.
vijju
  • 415
  • 1
  • 5
  • 9
18
votes
6 answers

How to suppress parquet log messages in Spark?

How to stop such messages from coming on my spark-shell console. 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:…
user568109
  • 47,225
  • 17
  • 99
  • 123
18
votes
3 answers

Convert file of JSON objects to Parquet file

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data. Is there any way to do this without first loading the data into Hive, etc and then using one of…
danieltahara
  • 4,743
  • 3
  • 18
  • 20
17
votes
2 answers

How do I save multi-indexed pandas dataframes to parquet?

How do I save the dataframe shown at the end to parquet? It was constructed this way: df_test = pd.DataFrame(np.random.rand(6,4)) df_test.columns = pd.MultiIndex.from_arrays([('A', 'A', 'B', 'B'), ('c1', 'c2', 'c3', 'c4')], names=['lev_0',…
techvslife
  • 2,273
  • 2
  • 20
  • 26
17
votes
1 answer

Read parquet data from AWS s3 bucket

I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this: S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey)); InputStream inputStream = object.getObjectContent(); But the apache…
Alexander
  • 391
  • 1
  • 4
  • 12
17
votes
1 answer

Is saving a HUGE dask dataframe into parquet possible?

I have a dataframe made up of 100,000+ rows and each row has 100,000 columns, totally to 10,000,000,000 float values. I've managed to read them in previously in a csv (tab-separated) file and I successfully read them to a 50 cores Xeon machine with…
alvas
  • 115,346
  • 109
  • 446
  • 738
17
votes
3 answers

SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

I have a DataFrame generated as follows: df.groupBy($"Hour", $"Category") .agg(sum($"value").alias("TotalValue")) .sort($"Hour".asc,$"TotalValue".desc)) The results look…
shubham rajput
  • 1,015
  • 1
  • 9
  • 12
17
votes
1 answer

Spark SQL: Why two jobs for one query?

Experiment I tried the following snippet on Spark 1.6.1. val soDF = sqlContext.read.parquet("/batchPoC/saleOrder") # This has 45 files soDF.registerTempTable("so") sqlContext.sql("select dpHour, count(*) as cnt from so group by dpHour order by…
Mohitt
  • 2,957
  • 3
  • 29
  • 52
16
votes
5 answers

How to Convert Many CSV files to Parquet using AWS Glue

I'm using AWS S3, Glue, and Athena with the following setup: S3 --> Glue --> Athena My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data. Since I'm using Athena, I'd like to convert the CSV files…
mark s.
  • 656
  • 2
  • 7
  • 14
16
votes
7 answers

Get schema of parquet file in Python

Is there any python library that can be used to just get the schema of a parquet file? Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing…
Saran
  • 835
  • 3
  • 11
  • 31
16
votes
2 answers

Fast Parquet row count in Spark

The Parquet files contain a per-block row count field. Spark seems to read it at some point (SpecificParquetRecordReaderBase.java#L151). I tried this in spark-shell: sqlContext.read.load("x.parquet").count And Spark ran two stages, showing various…
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
16
votes
2 answers

Apache Drill has bad performance against SQL Server

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was: SELECT p.Product_Category, SUM(f.sales) FROM facts f JOIN Product p on f.pkey = p.pkey GROUP BY p.Product_Category Where facts has…
Imbar M.
  • 1,074
  • 1
  • 10
  • 19
15
votes
5 answers

Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow=0.17. The error does not appear in pyarrow=1.0.1…
Russell Burdt
  • 2,391
  • 2
  • 19
  • 30
15
votes
5 answers

Read local Parquet file without Hadoop Path API

I'm trying to read a local Parquet file, however the only APIs I can find are tightly coupled with Hadoop, and require a Hadoop Path as input (even for pointing to a local file). ParquetReader reader =…
Ben Watson
  • 5,357
  • 4
  • 42
  • 65