0

I have a very large amount of data in my S3 bucket partitioned by two columns MODULE and DATE such that the file structure of my parquets are:

s3://my_bucket/path/file.parquet/MODULE='XYZ'/DATE=2020-01-01

I have 7 MODULE and the DATE ranges from 2020-01-01 to 2020-09-01. I found a discrepancy in the data and need to correct the MODULE entries for one of the module. Basically I need to change all data for a particular index number, belonging to MODULE XYZ to MODULE ABC. I can do this in pyspark by loading the data frame and doing something like:

df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))

But how do I repartition it so that only those entries that are changed get moved to the ABC MODULE partition? If I do something like:

df.mode('append').partitionBy('MODULE','DATE').parquet(s3://my_bucket/path/file.parquet")

I would be adding the data along with the erroneous MODULE data. Plus, I have almost a years worth of data and don't want to repartition the entire dataset as it would take a very long time.

Is there a way to do this?

thentangler
  • 1,048
  • 2
  • 12
  • 38
  • you can use the `input_file_name` function to identify the file you need to change, therefore, you can just overwrite these specific files instead of the whole data. That's just an idea. – Steven Sep 17 '20 at 14:55

2 Answers2

1

IIUC you can do this by filtering the data for that particular index then save that data with date as partition.

df=df.withColumn('MODULE', when(col('index')==34, "ABC").otherwise(col('MODULE')))
df = df.filter(col('index')==34)
df.mode('overwrite').partitionBy('DATE').parquet(s3://my_bucket/path/ABC/")

In this way you will only end up modifying only the changed module i.e. ABC

Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
  • Thank you for this solution. So I would need to delete the data in module `XYZ` right? As is with the solution provided by Steven above. Can I do that programmatically? – thentangler Sep 18 '20 at 01:06
1

If I understand well, you have data in partition MODULE=XYZ that should be moved to MODULE=ABC.

First, identify the impacted files.

from pyspark.sql import functions as F

file_list = df.where(F.col("index") == 34).select(
    F.input_file_name()
).distinct().collect()

Then, you create a dataframe based only on theses files, you use it to complete both MODULE.

df = spark.read.parquet(file_list).withColumn(
    "MODULE", when(col("index") == 34, "ABC").otherwise(col("MODULE"))
)

df.write.parquet(
    "s3://my_bucket/path/ABC/", mode="append", partitionBy=["MODULE", "DATE"]
)

At this point, ABC should be OK (you just added the missing data), but XYZ should be wrong because of duplicate data. To recover XYZ, you just need to delete the list of files in file_list.

Steven
  • 14,048
  • 6
  • 38
  • 73
  • Thank you! I did not know about `input_file_name()`. When you say delete the list of files in `file_list` do you mean manually in the S3 console or CLI? Or programmatically? How would I do it programmatically? – thentangler Sep 18 '20 at 01:03
  • 1
    @thentangler for me, this whole script is manual. So, you can use a S3 python lib. Or extract the path list on a file, and create a delete script to paste in S3 console or CLI – Steven Sep 18 '20 at 06:19