0

I have a scenario where I have a dataframe which contains say three columns, and next to each row in that dataframe I need to generate an ID. Unfortuantly I can't just use the UUID module which would make this easy and it has to be 6 characters in length.

I have found a solution here, fixed length ID, which solves that.

The issue I am facing though is that I don't know how to now iterate through the rows in the dataframe to create the new column. I've been trying a for loop but when it reaches the end it results in errors such as no append on dataframe etc.

I'm still fairly new to both Python and PySpark and would appreciate any pointers in the right direction for me to research to help me get moving again as currently I'm not sure how to progress.

Thanks in advance.

user1663003
  • 149
  • 1
  • 10

1 Answers1

1

You could use Pandas UDFs.

from pyspark.sql.functions import pandas_udf, PandasUDFType

# Use pandas_udf to define your UUID function
@pandas_udf('string', PandasUDFType.SCALAR)
def generate_uuid(seed):   
    uuid = rand(0,99999) # Do your magic here
    return uuid

df.withColumn('uuid', generate_uuid(df.column_used_to_generate_id))
Axel R.
  • 1,141
  • 7
  • 22
  • Hi. Thanks for this. When I run it it doesn't return the UUID column only the columns that were in the DF already. I would have expected withColumn to have added it is a fourth column. I also have an issue in that my work doesn't like Pandas being used for data engineering work. This isn't something I have control over unfortunately. I'll still upvote it as those who can will probably find this an answer to their problems but I still have to resort to other one as I need over 70 mill unique ID which if using numbers and letters can be handled in 6 chars which is human usable for searching. – user1663003 Jul 11 '22 at 11:19
  • OK this does work as expected just I can't edit to change it does. However I am still in the scenario as before. – user1663003 Jul 12 '22 at 11:35