Questions tagged [aws-data-wrangler]

AWS Data Wrangler offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases. It integrates with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Project: awswrangler · PyPI

69 questions
1
vote
1 answer

How to add odbc driver to aws glue python shell

I want to use pyodbc in aws qlue python shell but it require odbc driver. Currently I get error like "Can't open lib 'ODBC Driver 17 for SQL Server' : file not found (0) (SQLDriverConnect)" Is there any way to install odbc driver into glue
yunus kula
  • 859
  • 3
  • 10
  • 31
1
vote
0 answers

awswrangler.s3.to_parquet arguments question

If I have the following code: import awswrangler #df = some dataframe with year, date and other columns wr.s3.to_parquet( df=df, path=f's3://some/path/', index=False, dataset=True, mode="append", partition_cols=['year',…
1
vote
1 answer

What is AWS S3 dataset?

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter. From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks…
d.b
  • 32,245
  • 6
  • 36
  • 77
1
vote
0 answers

Create 'granular' date partitions (year, month, date) in S3 parquet folders from a single date column in AWS Wrangler

I am using data wrangler to upload data from a dataframe into S3 bucket parquet files, and am trying to get it in a 'Hive'-like folder structure of: prefix - year=2022 -- month=08 --- day=01 --- day=02 --- day=03 In the following code…
Ben L
  • 83
  • 8
1
vote
0 answers

Awswrangler, KeyError while trying to apply partition_filter. Key actually exists when importing dataframe

I am trying to load a pd.DataFrame, reading it form a parquet file in aws. I am applying partition_filter to get only certain data from the df that corresponds to the conditions I want. In this particular case, the df column, df['source'] must be…
The Dan
  • 1,408
  • 6
  • 16
  • 41
1
vote
1 answer

AWS Data Wrangler s3.to_parquet replicate current S3 path structure

When using wr.s3.to_parquet I can construct a path with a Formatted string literal and have existing folders using the pattern. def SaveInS3_test(Ticker, Granularity, Bucket, df, keyPrefix=""): year, month, day =…
nipy
  • 5,138
  • 5
  • 31
  • 72
1
vote
0 answers

ValidationException Importing from Redshift into Data Wrangler

I'm trying to build a model workflow in AWS SageMaker using Data Wrangler for preprocessing. I'm loading data from various tables in a Redshift instance, before mutating and joining them as required to build the model input data. I'm a contractor…
1
vote
0 answers

AWS Data Wrangler chunking: "Length mismatch: Expected axis has 28680 elements, new values have 100000 elements"

While loading the parquet file and getting the below error. My parquet file contains 5,128,680 rows and it is loading 5,100,000 only and not loading 28,680 records and code as like below: Error : Response { "errorMessage": "Length mismatch:…
1
vote
0 answers

Read data from AWS S3 using pyspark and python. (Read All columns: Partioned column also)

I have saved the spark dataframe to AWS S3 in Parquet format partitionby column "channel_name". Below code is how i saved to S3: df.write.option("header",True) \ .partitionBy("channel_name") \ .mode("overwrite") \ …
SSS
  • 73
  • 11
1
vote
2 answers

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this My code is as follows: …
1
vote
1 answer

AWS Glue - table version increases on data load even with no schema changes

I have a lambda job which infrequently dumps a parquet file into an S3 bucket/Glue table using AWS Wrangler. This Glue table appears to be increasing the table version number every time there is new data, even though the schema is unchanged. I do…
1
vote
0 answers

Numpy compatibility issue on AWS

I need to use AWS Data Wrangler, NumPy and SciPy in one AWS Lambda. To make it possible I use two Layers: Layer provided by AWS: AWSLambda-Python38-SciPy1x - AWS Lambda SciPy Layer for Python38 (scipy-1.5.1, numpy-1.19.0) Custom Layer created from…
alex
  • 10,900
  • 15
  • 70
  • 100
1
vote
1 answer

Connect to AWS Redshift using awswrangler

import awswrangler as wr con = wr.redshift.connect("MY_GLUE_CONNECTION") What would be the value of "MY_GLUE_CONNECTION"?
1
vote
1 answer

AWS Lambda - AwsWrangler - Pandas/Pytz - Unable to import required dependencies:pytz:

To get past Numpy errors, I downloaded this zip awswrangler-layer-1.9.6-py3.8 from https://github.com/awslabs/aws-data-wrangler/releases. I want to use Pandas to convert JSON to CSV and it's working fine in my PyCharm development environment on…
NealWalters
  • 17,197
  • 42
  • 141
  • 251
0
votes
0 answers

Is there a way to setup retries / timeouts with `awswrangler.athena.read_sql_query`?

Often I am running many long running athena queries with awswrangler and from time to time I receive: ReadTimeoutError: Read timeout on endpoint URL: "None" Sometimes I receive some other error, e.g., incorrect number of bytes received. Is there a…
Samuel Hapak
  • 6,950
  • 3
  • 35
  • 58