Questions tagged [parquet-dataset]

14 questions
3
votes
2 answers

AWS Athena - UPDATE table rows using SQL

I am newbie to AWS ecosystem. I am creating an application which queries data using AWS Athena. Data is transformed from JSON into parquet using AWS Glue and stored in S3. Now use case is to update that parquet data using SQL. can we update…
mds404
  • 371
  • 4
  • 9
1
vote
0 answers

How to prevent delay in chart rendering from .parquet data fetched from Flask backend?

I am trying to create a simple GUI dashboard by fetching data using a back end Flask server by triggering an AJAX request when I interact with the multi check box drop down menus. Essentially, I have two drop down menus called "Select Date" and…
1
vote
0 answers

Improving read performance of pyarrow

I have a partitioned dataset stored on internal S3 cloud. I am reading the dataset with pyarrow table import pyarrow.dataset as ds my_dataset = ds.dataset( ds_name, format="parquet", filesystem=s3file, partitioning="hive") fragments =…
Femi King
  • 11
  • 2
1
vote
0 answers

Pyarrow's write_to_dataset() causes "Calling the invoke API action failed with this message: Network Error" when partition_cols provided in AWS Lambda

I have an AWS Lambda Function (python 3.8) with pyarrow 9.0.0 and s3fs bundled together in a layer. The function reads multiple JSON files one by one and converts them to a parquet dataset with partitioning (year, month, day) to an S3 location. When…
1
vote
0 answers

ParquetDataset not taking the partitions from the filters

I have a parquet dataset stored on s3, and I would like to query specific rows from the if. I am doing it using pyarrow. My s3 dataset is partitioned using client year month day using hive partitioning (client=, year= ...). I am giving the…
Mhmd Dar
  • 13
  • 3
0
votes
0 answers

Dask dataframe create folders instead of files when saving processed files to parquet

I have some very large parquet files that i wanna make some processing, merging and cleaning and then and then, save those files into another folder. I am using dask dataframe since its the only way i can read those files without getting out of…
0
votes
1 answer

partitioning a Parquet file in Data Factory

I am doing my project in datafactory and I need to save information in a recurrent way in the same parket file. Every certain period of time there is an update of the information and I would like it to be added to the parquet as a partition of the…
0
votes
0 answers

Reading Parquet v2 file with Javascript

I've searched though the node package manager (NPM) and I can't seem to find a working Parquet library that also supports version 2. parquets was the only working parser I could find and I got this…
Hackermon
  • 78
  • 1
  • 7
0
votes
0 answers

Parquet - Specifying file path when using external key material

I have a use case where I have to encrypt my Parquet files. I implemented the KMSClient abstract class provided by Parquet CryptoFactory and have been able to encrypt and Decrypt the Parquet files and the DEK. While the above is working as expected,…
Alex Bloomberg
  • 855
  • 1
  • 7
  • 14
0
votes
0 answers

Join 2 large size tables (50 Gb and 1 billion records)

I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in : I need to tune it, as I am getting OOM errors due to Java heap space. I have to apply left join. There will not be any…
0
votes
1 answer

Reading Parquet file from Spark

I use following method to read a Parquet file in Spark scala> val df = spark.read.parquet("hdfs:/ORDER_INFO") scala> df.show() When I show content of DataFrame it displays with encoded language like below [49 4E 53 5F 32 33] [49 4E 53 5F 32 30] In…
0
votes
0 answers

How to import a parquet.gzip.cpgz file?

I am trying to open the following file in R: deputies.parquet.gzip.cpgz. Does anyone know how to do this? I have imported paruqet files before using the arrow, but I'm not sure how import this type.
w5698
  • 159
  • 7
0
votes
1 answer

Load Parquet Files from ADLS Gen2 using ADF

I would like to setup ADF pipeline in such a way that I need to load all the Parquet files hosted for 2+ years on ADLS Gen2 with a hierarchy of Year -> Month -> Day -> Hour - > Min. Over the period, we did have some file structure changes with a…
0
votes
0 answers

Parquet schema / data type for entire null object DataFrame columns

I'm writing some DataFrame to binary parquet format with one or more entire null object columns. If I then load the parquet dataset with use_legacy_dataset=False parquet_dataset = pq.ParquetDataset(root_path, use_legacy_dataset=False,…
mishbah
  • 5,487
  • 5
  • 25
  • 35