1

I have specific problem, where I want to retrieve the value of bu_id field from id and matched_ id.

When there is some value in matched_id column, bu_id should be same as the id for that particular id and ids of corresponding matched_id.

When matched_id is blank, bu_id should be same as id.

input

+---+------------+
|id |matched_id  |
+---+------------+
|0  |7,8         |
|1  |            |
|2  |4           |
|3  |5,9         |
|4  |2           |
|5  |3,9         |
|6  |            |
|7  |0,8         |
|8  |0,7         |
|9  |3,5         |

output

+---+------------+-----+
|id |matched_id  |bu_id|
+---+------------+-----+
|0  |7,8         |0    |
|1  |            |1    |
|2  |4           |2    |
|3  |5,9         |3    |
|4  |2           |2    |
|5  |3,9         |3    |
|6  |            |6    |
|7  |0,8         |0    |
|8  |0,7         |0    |
|9  |3,5         |3    | 

Can anyone help me how to approach this problem. Thanks in advance.

SDS
  • 169
  • 1
  • 2
  • 9
  • To clarify, If the `matched_id` column is null, then `bu_id` should be the same as `id`. If the `matched_id` column is **not** null, we should consider the values listed in both the `id` and `matched_id` columns and `bu_id` should be the minimum of those values. Is that correct? – smurphy Jan 12 '23 at 20:10
  • what is the type of each column in your input df? can you include the output from `df.schema` in your question? thanks! – Derek O Jan 12 '23 at 22:05
  • 1
    All columns are `stringType`. – SDS Jan 13 '23 at 07:31

2 Answers2

1

We should try to use functions exclusively from the pyspark.sql.functions module because these are optimized for pyspark dataframes (see here), whereas udfs are not and should be avoided when possible.

To achieve the desired output pyspark dataframe, we can concatenate both "id" and "matched_id" columns together, convert the string that into a list of strings using split, cast the result as an array of integers, and take the minimum of the array – and we can get away with not having to worry about the blank strings because they get converted into null, and F.array_min drops nulls from consideration. This can be done with the following line of code (and while it is a little hard to read, it gets the job done):

import pyspark.sql.functions as F

df = spark.createDataFrame(
    [
        ("0", "7,8"),
        ("1", ""),
        ("2", "4"),
        ("3", "5,9"),
        ("4", "2"),
        ("5", "3,9"),
        ("6", ""),
        ("7", "0,8"),
        ("8", "0,7"),
        ("9", "3,5"),
    ],
    ["id", "matched_id"]
)

df.withColumn(
    "bu_id", 
    F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),",").cast("array<int>"))
).show()

Output:

+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
|  0|       7,8|    0|
|  1|          |    1|
|  2|         4|    2|
|  3|       5,9|    3|
|  4|         2|    2|
|  5|       3,9|    3|
|  6|          |    6|
|  7|       0,8|    0|
|  8|       0,7|    0|
|  9|       3,5|    3|
+---+----------+-----+

Update: in the case of non-numeric strings in columns "id" and "matched_id", we can no longer cast to an array of integers, so we can instead use pyspark functions F.when and .otherwise (see here) to set our new column to the "id" column when "matched_id" is an empty string "", and apply our other longer nested function when "matched_id" is non-empty.

df2 = spark.createDataFrame(
    [
        ("0", "7,8"),
        ("1", ""),
        ("2", "4"),
        ("3", "5,9"),
        ("4", "2"),
        ("5", "3,9"),
        ("6", ""),
        ("7", "0,8"),
        ("8", "0,7"),
        ("9", "3,5"),
        ("x", ""),
        ("x", "y,z")
    ],
    ["id", "matched_id"]
)

df2.withColumn(
    "bu_id", 
    F.when(F.col("matched_id") != "", F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),","))).otherwise(
        F.col("id")
    )
).show()

Output:

+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
|  0|       7,8|    0|
|  1|          |    1|
|  2|         4|    2|
|  3|       5,9|    3|
|  4|         2|    2|
|  5|       3,9|    3|
|  6|          |    6|
|  7|       0,8|    0|
|  8|       0,7|    0|
|  9|       3,5|    3|
|  x|          |    x|
|  x|       y,z|    x|
+---+----------+-----+
Derek O
  • 16,770
  • 4
  • 24
  • 43
  • Thanks @Derek, the solution is correct. One addendum to it, if the `id` or `matched_id` column has proper `string` values like x, y etc. Would the `array_min` work or we need to think of some other solution. – SDS Jan 13 '23 at 07:48
  • @SDS i've updated the solution for a situation where you might have non-numeric strings in `id` or `matched_id`. we'll have to avoid trying to cast to an array of integers because that will fail for situations where matched_id might look like `"x","y"`, and we'll instead use `F.when` and `.otherwise` which will allow us to apply one function when matched_id is an empty string `""`, and apply a different function when matched_id has a non-empty string – Derek O Jan 13 '23 at 08:13
  • and `array_min` still works because strings can be compared. `"x"` is less than `"y"`, for example. hope this helps! – Derek O Jan 13 '23 at 08:22
0

To answer this question I assumed that the logic you are looking to implement is,

  • If the matched_id column is null, then bu_id should be the same as id.
  • If the matched_id column is not null, we should consider the values listed in both the id and matched_id columns and bu_id should be the minimum of those values.

The Set-Up

# imports to include
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
# making your dataframe
df = spark.createDataFrame(
    [
     ('0','7,8'),
     ('1',''),
     ('2','4'),
     ('3','5,9'),
     ('4','2'),
     ('5','3,9'),
     ('6',''),
     ('7','0,8'),
     ('8','0,7'),
     ('9','3,5'),
    ],
    ['id', 'matched_id'])
print(df.schema.fields)
df.show(truncate=False)

In this df, both the id and matched_id columns are StringType data types. The code that follows builds-off this assumption. You can check the column types in your df by running print(df.schema.fields)

id matched_id
0 7,8
1
2 4
3 5,9
4 2
5 3,9
6
7 0,8
8 0,7
9 3,5

The Logic

To implement the logic for bu_id, we created a function called bu_calculation that defines the logic. Then we wrap the function in pyspark sql UDF. The bu_id column is then created by inputing the columns we need to evaluate (the id and matched_id columns) into the UDF

# create custom function with the logic for bu_id
def bu_calculation(id_col, matched_id_col):
    id_int = int(id_col)
    
    # turn the string in the matched_id column into a list and remove empty values from the list
    matched_id_list = list(filter(None, matched_id_col.split(",")))
    
    if len(matched_id_list) > 0:
        # if matched_id column has values, convert strings to ints
        all_ids = [int(x) for x in matched_id_list]
        
        # join id column values with matched_id column values
        all_ids.append(id_int)
        
        # return minimum value
        return min(all_ids)
    
    else:
        # if matched_id column is empty return the id column value
        return id_int

# apply custom bu_calculation function to pyspark sql udf
# the use of IntegerType() here enforces that the bu_calculation function has to return an int
bu_udf = F.udf(bu_calculation, IntegerType())

# make a new column called bu_id using the pyspark sql udf we created called bu_udf
df = df.withColumn('bu_id', bu_udf('id', 'matched_id'))
df.show(truncate=False)
id matched_id bu_id
0 7,8 0
1 1
2 4 2
3 5,9 3
4 2 2
5 3,9 3
6 6
7 0,8 0
8 0,7 0
9 3,5 3

More about the pyspark sql udf function here: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html

smurphy
  • 148
  • 7
  • 1
    while this solution works, i feel like it's important to point out that udfs are not optimized for pyspark dataframes. sometimes we have to resort to udfs because the transformation we want to make to a pyspark dataframe is too complicated to be done with any of the functions from `pyspark.sql.functions`, but i think it's possible to solve this problem without udfs. but this is a good effort, has the advantage of being a bit more readable, and could be repurposed for a similar, but more complicated problem! – Derek O Jan 12 '23 at 22:31
  • thanks a lot @smurphy for the solution.. I have solution in `udf` as well. However, I wanted something in pyspark – SDS Jan 13 '23 at 07:33