Questions tagged [aws-data-wrangler]

AWS Data Wrangler offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases. It integrates with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Project: awswrangler · PyPI

69 questions

vote

1 answer

How to add odbc driver to aws glue python shell

I want to use pyodbc in aws qlue python shell but it require odbc driver. Currently I get error like "Can't open lib 'ODBC Driver 17 for SQL Server' : file not found (0) (SQLDriverConnect)" Is there any way to install odbc driver into glue

asked Dec 01 '22 at 18:57

yunus kula

vote

0 answers

awswrangler.s3.to_parquet arguments question

If I have the following code: import awswrangler #df = some dataframe with year, date and other columns wr.s3.to_parquet( df=df, path=f's3://some/path/', index=False, dataset=True, mode="append", partition_cols=['year',…

amazon-web-services amazon-s3 aws-glue amazon-athena aws-data-wrangler

asked Nov 30 '22 at 20:11

Petar Ulev

vote

1 answer

What is AWS S3 dataset?

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter. From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks…

amazon-web-services aws-data-wrangler

asked Sep 28 '22 at 16:07

d.b

32,245
6
36
77

vote

0 answers

Create 'granular' date partitions (year, month, date) in S3 parquet folders from a single date column in AWS Wrangler

I am using data wrangler to upload data from a dataframe into S3 bucket parquet files, and am trying to get it in a 'Hive'-like folder structure of: prefix - year=2022 -- month=08 --- day=01 --- day=02 --- day=03 In the following code…

amazon-sagemaker aws-data-wrangler

asked Aug 19 '22 at 04:33

Ben L

vote

0 answers

Awswrangler, KeyError while trying to apply partition_filter. Key actually exists when importing dataframe

I am trying to load a pd.DataFrame, reading it form a parquet file in aws. I am applying partition_filter to get only certain data from the df that corresponds to the conditions I want. In this particular case, the df column, df['source'] must be…

python python-3.x pandas amazon-s3 aws-data-wrangler

asked Aug 16 '22 at 12:59

The Dan

1,408
6
16
41

vote

1 answer

AWS Data Wrangler s3.to_parquet replicate current S3 path structure

When using wr.s3.to_parquet I can construct a path with a Formatted string literal and have existing folders using the pattern. def SaveInS3_test(Ticker, Granularity, Bucket, df, keyPrefix=""): year, month, day =…

python pandas amazon-web-services aws-data-wrangler

asked Aug 06 '22 at 23:53

nipy

5,138
5
31
72

vote

0 answers

ValidationException Importing from Redshift into Data Wrangler

I'm trying to build a model workflow in AWS SageMaker using Data Wrangler for preprocessing. I'm loading data from various tables in a Redshift instance, before mutating and joining them as required to build the model input data. I'm a contractor…

amazon-redshift amazon-sagemaker aws-data-wrangler

asked Jun 21 '22 at 01:23

A. White

vote

0 answers

AWS Data Wrangler chunking: "Length mismatch: Expected axis has 28680 elements, new values have 100000 elements"

While loading the parquet file and getting the below error. My parquet file contains 5,128,680 rows and it is loading 5,100,000 only and not loading 28,680 records and code as like below: Error : Response { "errorMessage": "Length mismatch:…

python amazon-web-services aws-lambda aws-data-wrangler

asked Jan 28 '22 at 09:22

murali

vote

0 answers

Read data from AWS S3 using pyspark and python. (Read All columns: Partioned column also)

I have saved the spark dataframe to AWS S3 in Parquet format partitionby column "channel_name". Below code is how i saved to S3: df.write.option("header",True) \ .partitionBy("channel_name") \ .mode("overwrite") \ …

python pandas amazon-s3 pyspark aws-data-wrangler

asked Jan 07 '22 at 14:15

SSS

vote

2 answers

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this My code is as follows: …

amazon-web-services amazon-s3 parquet aws-data-wrangler

asked Sep 15 '21 at 02:18

Nirav Nagda

vote

1 answer

AWS Glue - table version increases on data load even with no schema changes

I have a lambda job which infrequently dumps a parquet file into an S3 bucket/Glue table using AWS Wrangler. This Glue table appears to be increasing the table version number every time there is new data, even though the schema is unchanged. I do…

amazon-web-services amazon-s3 aws-lambda aws-glue aws-data-wrangler

asked Jul 27 '21 at 10:00

km-if-so

vote

0 answers

Numpy compatibility issue on AWS

I need to use AWS Data Wrangler, NumPy and SciPy in one AWS Lambda. To make it possible I use two Layers: Layer provided by AWS: AWSLambda-Python38-SciPy1x - AWS Lambda SciPy Layer for Python38 (scipy-1.5.1, numpy-1.19.0) Custom Layer created from…

python amazon-web-services numpy aws-lambda aws-data-wrangler

asked Jun 21 '21 at 14:36

alex

10,900
15
70
100

vote

1 answer

Connect to AWS Redshift using awswrangler

import awswrangler as wr con = wr.redshift.connect("MY_GLUE_CONNECTION") What would be the value of "MY_GLUE_CONNECTION"?

python amazon-web-services amazon-redshift aws-glue aws-data-wrangler

asked May 16 '21 at 13:09

PrajaktaParkhade

vote

1 answer

AWS Lambda - AwsWrangler - Pandas/Pytz - Unable to import required dependencies:pytz:

To get past Numpy errors, I downloaded this zip awswrangler-layer-1.9.6-py3.8 from https://github.com/awslabs/aws-data-wrangler/releases. I want to use Pandas to convert JSON to CSV and it's working fine in my PyCharm development environment on…

amazon-web-services aws-lambda python-3.8 aws-data-wrangler

asked Oct 27 '20 at 15:02

NealWalters

17,197
42
141
251

votes

0 answers

Is there a way to setup retries / timeouts with `awswrangler.athena.read_sql_query`?

Often I am running many long running athena queries with awswrangler and from time to time I receive: ReadTimeoutError: Read timeout on endpoint URL: "None" Sometimes I receive some other error, e.g., incorrect number of bytes received. Is there a…

amazon-web-services aws-data-wrangler

asked Aug 24 '23 at 09:41

Samuel Hapak

6,950
3
35
58

Prev 1

3 4 5 Next