Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions
1
vote
1 answer

How to read a parquet file in Azure Databricks?

I have few parquet files stored in my storage account, which I am trying to read using the below code. However it fails with error as incorrect syntax. Can someone suggest to me as whats the correct way to read parquet files using azure…
ZZZSharePoint
  • 1,163
  • 1
  • 19
  • 54
1
vote
1 answer

Databricks - Autoloader - Not Terminating?

I'm new to databricks and I have several azure blob .parquet locations I'm pulling data from and want to put through the autoloader so I can "create table ... using delta location ''" in SQL in another step. (Each parquet file is in its own…
1
vote
2 answers

Dask writing into multiple parquet files by key

I have a very large dataset on disk as a csv file. I would like to load this into dask, do some cleaning, and then save the data for each value of date into a separate file/folder, as follows: . └── test └── 20211201 └── part.0.parquet …
Nezo
  • 567
  • 4
  • 18
1
vote
1 answer

Is there an efficient way of changing a feather file to a parquet file?

I have a big feather file, which I want to change to parquet, so that I can work with Pyspark. Is there a more efficient way of change the file type than doing the following: df = pd.read_feather('file.feather').set_index('date') df_parquet =…
TiTo
  • 833
  • 2
  • 7
  • 28
1
vote
0 answers

Tried reading a parquet file in spring boot without using spark. It works in local machine but doesn't work when deployed on AWS ECS container

Getting this error:- Can not read value at 0 in block -1 in file file: localdirectory/samplefile.parquet. I have to read a directory containing parquet file from s3 bucket. For this, I am downloading the directory from s3 in local and reading it in…
mold_9580
  • 11
  • 2
1
vote
0 answers

Redshift COPY error: "Assert code: 1000 context: Reached unreachable code - Invalid type: 6551 query"

We are trying to copy data from s3 (parquet files) to redshift. Here are the respective details. Athena DDL: CREATE EXTERNAL tablename( `id` int, `col1` int, `col2` date, `col3` string, `col4` decimal(10,2), binarycol binary); Redshift DDL: CREATE…
1
vote
1 answer

Dir columns coming by default while querying parquet files in apache drill 1.20 versions

In the latest version of drill, the dir columns are coming by default when giving a 'select *' on a parquet file. Is there a way we can disable them? The query: 'Select * from dfs.`C:\Sample.parquet` where EmpID <>'null'' The result for the above…
Rik
  • 81
  • 1
  • 15
1
vote
1 answer

Streaming parquet files from S3 (Python)

I should begin by saying that this is not running in Spark. What I am attempting to do is stream n records from a parquet file in S3 process stream back to a different file in S3 ...but am only inquiring about the first step. Have tried various…
1
vote
0 answers

Read a parquet.snappy file from AWS S3 React native

I am working on a app with react native and we are at a point where we need to read the parquet.snappy file from s3 bucket in react native app. is there any library for that?
1
vote
1 answer

Continuously Updating Partitioned Parquet

I have a Spark script that pulls data from a database and writes it to S3 in parquet format. The parquet data is partitioned by date. Because of the size of the table, I'd like to run the script daily and have it just rewrite the most recent few…
maxwellray
  • 99
  • 7
1
vote
2 answers

Databricks: reading data with .snappy.parquet extension

I have a table with .snappy.parquet extension. data= 'part-001-36b4-7ea3-4165-8742-2f32d8643d-c000.snappy.parquet' I would like to read this and I tried the following: table = spark.read.load(data, format='delta') When I try with the above…
Hiwot
  • 568
  • 5
  • 18
1
vote
1 answer

Parquet file with more than one schema

I am used to parquet file with a single schema. I came across a file which, seemingly has more than one schema. I used pandas to convert it to a CSV file. The result is some things like this: table-1,table-2,table-3 0, {data for table-1} {dat for…
lang2
  • 11,433
  • 18
  • 83
  • 133
1
vote
2 answers

Selecting deep columns in pyarrow.dataset parquet

Let's say I have a deeply nested arrow table like: pyarrow.Table arr: struct not null, b: list not null> not null> child 0, arr: struct not null, b:…
mdurant
  • 27,272
  • 5
  • 45
  • 74
1
vote
1 answer

Can't view Staged Parquet File in S3 from Snowflake

I'm working on moving some Parquet files in S3 over to Snowflake. The Storage Integration & External Stage were created just fine, and when I run the list @mystage command I can see the file that I want to check out in S3 so I know it exists & that…
jyablonski
  • 711
  • 1
  • 7
  • 17
1
vote
0 answers

Timestamp conversion in Kinesis firehose after record format conversion to Parquet

I have been using record format conversion in kinesis firehose for converting the files in parquet format in S3 where in the schema that I have is being stored in AWS Glue. I am struggling in an issue where I am unable to configure the timestamp…
1 2 3
99
100