1

New to PySpark and need help with this problem I'm running into

I have a dataframe that contains two columns as shown below:

+------------------+--------------------+
| first_name       |  last_name         |
+------------------+--------------------+
| ["John","Jane"]  | ["Smith","Doe"]    |
| ["Dwight"]       | ["Schrute"]        |
| null             | null               |
+------------------+--------------------+

Basically what I want to do is create a dataframe that would have a schema like this:

root
 |-- names: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- first_name: string (nullable = true)
 |    |    |-- last_name: string (nullable = true)

or output like this:

+-------------------------------------+
| names                               |
+-------------------------------------+
| [{"John", "Smith},{"Jane", "Doe"}]  |
| [{"Dwight", "Schrute"}]             |
| [{}]                                |
+-------------------------------------+

I'm not super concerned about the null case, that shouldn't be too much of an issue. The biggest hurdle I'm facing is combining these two arrays into one organized struct. The array values are not of fixed length, however both arrays will always be of the same size.

I am assuming I will need to use a UDF, which I have tried many variations of something like:

import pyspark.sql as F

convert_names_udf = F.udf(lambda first_name_array, last_name_array:
                              [struct(F.lit(first_name_array[i]).alias("first_name"), 
                                      F.lit(last_name_array[i]).alias("last_name")
                                  for i in range(len(first_name_array))
                              ], ArrayType(StructType()))
df.withColumn("names", convert_names_udf(F.flatten(F.col("first_name"), F.col("last_name"))

However that approach does not seem to be working, and only throws errors. (I'm sure that is not the correct way to us the lit function, however the for i in range(len(first_name_array) allows me to iterate over all the first names)

Any help to even point me in the right direction is much appreciated! Thanks!

Nickds54
  • 9
  • 1

2 Answers2

0

There is a nice function arrays_zip.

df = df.withColumn('name', f.arrays_zip('first_name', 'last_name'))
​
df.show(truncate=False)
df.select('name').printSchema()
+------------+------------+----------------------------+
|first_name  |last_name   |name                        |
+------------+------------+----------------------------+
|[John, Jane]|[Smith, Doe]|[{John, Smith}, {Jane, Doe}]|
|[Dwight]    |[Schrute]   |[{Dwight, Schrute}]         |
|null        |null        |null                        |
+------------+------------+----------------------------+

root
 |-- name: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- first_name: string (nullable = true)
 |    |    |-- last_name: string (nullable = true)
Lamanus
  • 12,898
  • 4
  • 21
  • 47
  • Hi thank you for your comment! Unfortunately, I don't think this is correct. When I use the array_zip function, it just creates a struct column of arrays that looks like this: {first_name[ ], last_name: [ ]} – Nickds54 Jul 06 '23 at 14:50
  • This has the schema what you want. – Lamanus Jul 07 '23 at 00:30
-1

I figured it out, posting my solution for anyone who might run into a similar problem.

Basically what I had to do was use posexplode on both columns and then join/collect based on the array position. So my code looks something like:

import pyspark.sql.functions as F

# Creating ID for aggregating back into df later
df = df.groupby("first_name").withColumn("id", F.count("first_name"))

# Exploded first_name column into [pos, col]
first_name_df = df.select(F.col("id"), F.posexplode(F.col("first_name")))

# Renaming default "col" column to "first_name"
first_name_df = first_name_df.withColumnRenamed("col","first_name")

# Doing same thing as above but for last_name
last_name_df = df.select(F.col("id"), F.posexplode(F.col("last_name")))
last_name_df = last_name_df.withColumnRenamed("col","last_name")

# Joining first_name and last_name based on array position
joined_df = first_name_df.join(last_name_df, on="pos")

# combining first_name and last_name into a struct
joined_df = joined_df.withColumn("name", struct(F.col("first_name), F.col("last_name)))

# Collecting all the structs with same id into an array of structs
joined_df = joined_df.groupby("id").agg(F.collectlist("name").alias("names"))

# Adding new column back to original dataframe
df = df.join(joined_df, on="id")

Nickds54
  • 9
  • 1