New to PySpark and need help with this problem I'm running into
I have a dataframe that contains two columns as shown below:
+------------------+--------------------+
| first_name | last_name |
+------------------+--------------------+
| ["John","Jane"] | ["Smith","Doe"] |
| ["Dwight"] | ["Schrute"] |
| null | null |
+------------------+--------------------+
Basically what I want to do is create a dataframe that would have a schema like this:
root
|-- names: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- first_name: string (nullable = true)
| | |-- last_name: string (nullable = true)
or output like this:
+-------------------------------------+
| names |
+-------------------------------------+
| [{"John", "Smith},{"Jane", "Doe"}] |
| [{"Dwight", "Schrute"}] |
| [{}] |
+-------------------------------------+
I'm not super concerned about the null case, that shouldn't be too much of an issue. The biggest hurdle I'm facing is combining these two arrays into one organized struct. The array values are not of fixed length, however both arrays will always be of the same size.
I am assuming I will need to use a UDF, which I have tried many variations of something like:
import pyspark.sql as F
convert_names_udf = F.udf(lambda first_name_array, last_name_array:
[struct(F.lit(first_name_array[i]).alias("first_name"),
F.lit(last_name_array[i]).alias("last_name")
for i in range(len(first_name_array))
], ArrayType(StructType()))
df.withColumn("names", convert_names_udf(F.flatten(F.col("first_name"), F.col("last_name"))
However that approach does not seem to be working, and only throws errors. (I'm sure that is not the correct way to us the lit function, however the for i in range(len(first_name_array)
allows me to iterate over all the first names)
Any help to even point me in the right direction is much appreciated! Thanks!