0

I started using AWS Data Wrangler and Athena, to upload my data files onto S3, and being able to query them, respectively. My question is about the procedure to "safely" update the data in the table.

Here is what I did:

  1. I used the AWS Data Wrangler,

    wr.s3.to_parquet(
    df=my_df,
    path='s3://temp',
    dataset=True,
    partition_cols = ['date'],
    concurrent_partitioning = True,
    mode = append
    )
    

    to export my DataFrame to S3 -as parquet files.

  2. I used AWS Glue to crawl my data files in S3, and create Table A.

  3. Via Athena, I can query my data in Table A.

Now I would like to update my data in table A. My questions are:

  1. (insert new rows): If I follow what I did above, and change the mode to "overwrite", then all my data in S3 is removed (expected, not desirable!), and the new data files is written onto S3. Thus, the Glue catalogue crawl results in NEW data rows in Table A, while removing all my past data in Table A. This is not my desire. Is there a safe way to do this?
  2. (update rows): What if I would like to UPDATE data in Table A. Using AWS Wrangler, How can I achieve this, without adding redundant rows in Table A?

I have tested (2), using mode=append and indeed I ended up getting redundant rows in my Table A.

I appreciate if you kindly share your thoughts

Thanks,

Teodorico Levoff
  • 1,641
  • 2
  • 26
  • 44
tafteh
  • 41
  • 1
  • 8

2 Answers2

0

I think you need use mode="overwrite_partitions"

sylas
  • 1
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 06 '22 at 21:49
0

You can use the below API:

awswrangler.s3.merge_upsert_table

Ref: https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.merge_upsert_table.html

Tyler2P
  • 2,324
  • 26
  • 22
  • 31