Questions tagged [aws-data-wrangler]

AWS Data Wrangler offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases. It integrates with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Project: awswrangler · PyPI

69 questions
0
votes
0 answers

Is it possible to omit header rows when exporting a SageMaker DataWrangler flow to s3 (via a Jupyer Notebook)?

I am exporting a DataWrangler flow to s3 via a Jupyter Notebook using SageMaker Studio. Each of the resulting CSV files (each containing a part of the transformed dataset) include a header row with the column names. However when using a CSV file as…
0
votes
0 answers

Which pandas module can be used read parquet file in parallel?

I am using Data wrangler to read parquet datasets. The partition has 300 files and each files are around 256 mb. I am using sagemaker ml.r5.24xlarge which has 96 cores. Processing job is doing 3 task. Read parquet file. Execute model Write the…
user3858193
  • 1,320
  • 5
  • 18
  • 50
0
votes
0 answers

How to connect to Amazon Athena using Simba ODBC and Python

I have python code that read data from Athena and that works fine in AWS portal , but it's not working from my local computer because of security policies (using secret key is forbidden for us). This code uses awswrangler and boto3 to read the data.…
0
votes
0 answers

AWS Wrangler WaiterError: Waiter BucketExists failed: Max attempts exceeded. Previously accepted state: Matched expected HTTP status code: 404

I've been trying to query information frin Athena using the next bit of code: import boto3 import awswrangler as wr sessAWS_Test = boto3.session.Session( aws_access_key_id = 'id', aws_secret_access_key = 'key', region_name = 'region…
0
votes
1 answer

aws wrangler (pandas layer). problem with path to S3 bucket

here is my python code in my lambda layer. Shout out to John R, for some of this paginator code. from api gateway, I pass in path param (bucket) and query string params (fmt & date), such…
0
votes
1 answer

awswrangler query Athenea: AttributeError: Can only use .dt accessor with datetimelike values

I have one table in Athena which all columns with proper datatypes(date, bigint,int,decimal(28,2) string and etc.). I need to query out the data via aws wrangler API :athena.read_sql_query I write: athena.read_sql_query(sql=test_query,…
0
votes
0 answers

how to remove extra space in json output from lambda

I have a lambda reading csv's in an S3 bucket. The api gateway calls the lambda. The data in the csv is like this: Ticker Exchange Date Open High Low Close Volume 6A BATS 12/2/2021 0.9 0.95 0.83 0.95 …
0
votes
0 answers

AWS Wrangler wr.athena.read_sql_query, s3_additional_kwargs not tagging s3 objects

I am trying to add cost-allocation tags to S3 resources created by Athena queries, in a way that I can analyze the S3 costs of different application related to Athena usage. To achieve this, I am making use of the parameter s3_additional_kwargs when…
0
votes
0 answers

How to test awswrangler with local data

I am working with awswrangler to execute athena queries and transform it with pandas. I want to test my code in local without any actual aws instance. Is there a way to mock aws services or other ways to work with awswrangler in local ?
0
votes
0 answers

How do I merge several parquet files into one using awswrangler?

I am trying to use awswrangler.s3.merge_datasets() using a glob source string but it isn't working for me. https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.merge_datasets.html import glob import awswrangler as…
jtlz2
  • 7,700
  • 9
  • 64
  • 114
0
votes
0 answers

Unable to query parquet data which has array datatype

When using awswrangler and writing to S3 in parquet format, the data files are not queryable using S3 select (for csv) or Athena. For e.g. events = [{"c1": "12", "c2": [1, 2, 3, 6], "c3": 1234}] df = pd.DataFrame.from_dict(events) wr.s3.to_parquet( …
Raman
  • 665
  • 1
  • 15
  • 38
0
votes
1 answer

How to specify the location of athena query results when using awswrangler

The python code below can fetch data from a pre-configured athena table when it is run on local computer. But it automatically creates an S3 bucket to store temporary tables and metadata. The automatically created bucket name looks like…
d.b
  • 32,245
  • 6
  • 36
  • 77
0
votes
0 answers

How to check which column is a value from when using awswrangler to write a parquet file to s3?

I have some dataframes with various columns and rows pulled from some worksheets in google sheets using pygsheets and from postgres tables in several databases, trying to write these on s3 buckets using awswrangler. For most of them I don't have to…
fmvio
  • 1
  • 2
0
votes
0 answers

Missing s3 package from AWS Wrangler

I installed the latest version of AWS Wrangler, 2.19.0. When I run the import this happens. import awswrangler as wr File \~/opt/anaconda3/lib/python3.9/site-packages/awswrangler/lakeformation/\_utils.py:13, in \ 11 from awswrangler import…
0
votes
1 answer

How can I apply a unique filter to partition column of a parquet file using wr.s3.read_parquet?

I have a parquet dataset stored in s3 and I want to read it to apply a filter to the partition field, specifically the unique. I was trying as follows, however the unique function cannot be applied Here's my attempt: query_fecha_dato =…