Highest Voted 'parquet' Questions

18

votes

1 answer

How to force parquet dtypes when saving pd.DataFrame?

Is there a way to force a parquet file to encode a pd.DataFrame column as a given type, even though all values for the column are null? The fact that parquet automatically assigns "null" in its schema is preventing me from loading many files into a…

asked May 01 '18 at 00:45

HugoMailhot

1,275
1
10
19

18

votes

3 answers

Parquet Writer to buffer or byte stream

I have a java application that converts json messages to parquet format. Is there any parquet writer which writes to buffer or byte stream in java? Most of the examples, I have seen write to files.

java bufferedreader parquet

asked Oct 17 '16 at 14:58

vijju

415
1
5
9

18

votes

6 answers

How to suppress parquet log messages in Spark?

How to stop such messages from coming on my spark-shell console. 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:…

logging apache-spark parquet

asked May 05 '15 at 12:25

user568109

47,225
17
99
123

18

votes

3 answers

Convert file of JSON objects to Parquet file

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data. Is there any way to do this without first loading the data into Hive, etc and then using one of…

json apache parquet apache-drill

asked Feb 11 '14 at 00:54

danieltahara

4,743
3
18
20

17

votes

2 answers

How do I save multi-indexed pandas dataframes to parquet?

How do I save the dataframe shown at the end to parquet? It was constructed this way: df_test = pd.DataFrame(np.random.rand(6,4)) df_test.columns = pd.MultiIndex.from_arrays([('A', 'A', 'B', 'B'), ('c1', 'c2', 'c3', 'c4')], names=['lev_0',…

pandas parquet

asked Feb 25 '19 at 07:33

techvslife

2,273
2
20
26

17

votes

1 answer

Read parquet data from AWS s3 bucket

I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this: S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey)); InputStream inputStream = object.getObjectContent(); But the apache…

java amazon-web-services amazon-s3 parquet

asked Oct 19 '17 at 13:38

Alexander

391
1
4
12

17

votes

1 answer

Is saving a HUGE dask dataframe into parquet possible?

I have a dataframe made up of 100,000+ rows and each row has 100,000 columns, totally to 10,000,000,000 float values. I've managed to read them in previously in a csv (tab-separated) file and I successfully read them to a 50 cores Xeon machine with…

python dataframe parquet dask fastparquet

asked May 26 '17 at 06:00

alvas

115,346
109
446
738

17

votes

3 answers

SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

I have a DataFrame generated as follows: df.groupBy($"Hour", $"Category") .agg(sum($"value").alias("TotalValue")) .sort($"Hour".asc,$"TotalValue".desc)) The results look…

scala apache-spark apache-spark-sql parquet

asked Jan 15 '17 at 17:19

shubham rajput

1,015
1
9
12

17

votes

1 answer

Spark SQL: Why two jobs for one query?

Experiment I tried the following snippet on Spark 1.6.1. val soDF = sqlContext.read.parquet("/batchPoC/saleOrder") # This has 45 files soDF.registerTempTable("so") sqlContext.sql("select dpHour, count(*) as cnt from so group by dpHour order by…

apache-spark apache-spark-sql unsafe parquet

asked Jun 10 '16 at 19:37

Mohitt

2,957
3
29
52

16

votes

5 answers

How to Convert Many CSV files to Parquet using AWS Glue

I'm using AWS S3, Glue, and Athena with the following setup: S3 --> Glue --> Athena My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data. Since I'm using Athena, I'd like to convert the CSV files…

amazon-s3 parquet amazon-athena aws-glue

asked Apr 23 '18 at 16:54

mark s.

656
2
7
14

16

votes

7 answers

Get schema of parquet file in Python

Is there any python library that can be used to just get the schema of a parquet file? Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing…

python parquet

asked Jan 10 '17 at 10:54

Saran

835
3
11
31

16

votes

2 answers

Fast Parquet row count in Spark

The Parquet files contain a per-block row count field. Spark seems to read it at some point (SpecificParquetRecordReaderBase.java#L151). I tried this in spark-shell: sqlContext.read.load("x.parquet").count And Spark ran two stages, showing various…

apache-spark parquet

asked Nov 16 '16 at 10:15

Daniel Darabos

26,991
10
102
114

16

votes

2 answers

Apache Drill has bad performance against SQL Server

I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was: SELECT p.Product_Category, SUM(f.sales) FROM facts f JOIN Product p on f.pkey = p.pkey GROUP BY p.Product_Category Where facts has…

performance parquet apache-drill

asked Sep 19 '16 at 22:27

Imbar M.

1,074
1
10
19

15

votes

5 answers

Python error using pyarrow - ArrowNotImplementedError: Support for codec 'snappy' not built

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow=0.17. The error does not appear in pyarrow=1.0.1…

parquet pyarrow apache-arrow

asked Feb 02 '21 at 21:19

Russell Burdt

2,391
2
19
30

15

votes

5 answers

Read local Parquet file without Hadoop Path API

I'm trying to read a local Parquet file, however the only APIs I can find are tightly coupled with Hadoop, and require a Hadoop Path as input (even for pointing to a local file). ParquetReader reader =…

java hadoop parquet

asked Jan 27 '20 at 21:40

Ben Watson

5,357
4
42
65

Questions tagged [parquet]