How can I apply a unique filter to partition column of a parquet file using wr.s3.read_parquet?

Question

I have a parquet dataset stored in s3 and I want to read it to apply a filter to the partition field, specifically the unique. I was trying as follows, however the unique function cannot be applied

Here's my attempt:

query_fecha_dato = "{0}fecha_dato={1}/".format(param.delivery["output_path"], fecha_dato_formato)
print(query_fecha_dato)
df_fecha_datos = wr.s3.read_parquet(path=query_fecha_dato,dataset=True,filters=[('fecha_dato','unique',fecha_dato)])
print(df_fecha_datos.head(5))

It should show only the partition column "fecha_dato", however it shows the following:

nro_de_pedido nro_de_negocio  ... nrootchex ingest_date
0    2006968078      635922336  ...        -1  2022-08-06
1    2006968079      635912195  ...        -1  2022-08-06
2    2006968080      635921361  ...        -1  2022-08-06
3    2006968081      635922792  ...        -1  2022-08-06
4    2006968082      635922368  ...        -1  2022-08-06

I want to obtain only the partition column "fecha_dato" without duplicates

score 0 · Answer 1 · answered Dec 13 '22 at 06:41

0

Hi and welcome to stackoverflow. :)

It really helps to have a minimal, reproducible examples to test the code, so I can test whether my answer actually works.

I am new to awswranger but according to the docu I cannot find filters as an option.

It looks like to select only fecha_dato, you need to specify columns=['fecha_dato']. Furthermore I don't see a unique option in awswrangler, but you can use pandas drop_duplicates afterwards

df_fecha_datos = wr.s3.read_parquet(path=query_fecha_dato,dataset=True,colums=['fecha_dato']).drop_duplicates()

should work - at least as long as you do not get multiple dataframes back from s3.

This downloads all values in fecha_dato and only drop the duplicates locally, but I have no good idea how to save this bandwidth without deploying some compute resources in AWS.

answered Dec 13 '22 at 06:41

maow

2,712
1
11
25

Hello thank you very much! I added the line of code, however I got the following error:[INFO] 2022-12-13T14:14:31.793Z 84d08f6e-cec5-4a83-9e2a-261822db63f1 Se agrego 06.12 s3://entel-datalake-cl-landingzone-dev/itrc_delivery_breighstar/fecha_dato=2022-08/ Empty DataFrame Columns: [] Index: [] [] [ERROR] 2022-12-13T14:14:31.980Z 84d08f6e-cec5-4a83-9e2a-261822db63f1 Error: 'fecha_dato' [INFO] 2022-12-13T14:14:31.980Z 84d08f6e-cec5-4a83-9e2a-261822db63f1 Error – Jeanpiere Alcocer Dec 13 '22 at 14:20
This looks like it cannot find the column. Are you sure the column `fecha_dato` exists? A common gotcha is also that it might contain whitespaces like `fecha_dato `. – maow Dec 14 '22 at 06:49

How can I apply a unique filter to partition column of a parquet file using wr.s3.read_parquet?

1 Answers1