0

I want to create a row based on a column.

For example - I have the following data frame.

| lookup_name | alt_name | inventory | location |
|-------------|----------|-----------|----------|
| Honda       | Car      | 1         | au       |
| Apple       | Fruit    | 1         | us       |


I want to convert it to the following

| lookup_name | inventory | location |
|-------------|-----------|----------|
| Honda       | 1         | au       |
| Car         | 1         | au       |
| Apple       | 1         | us       |
| Fruit       | 1         | us       |

Where the alternative name column is removed and the locations and inventory are copied against the new lookup_name entry.

Lance
  • 768
  • 7
  • 21

1 Answers1

0
data= [
    ('Honda', 'Car', 1, 'au'),
    ('Apple', 'Fruit', 1, 'us'),
]

df = spark.createDataFrame(data, ['lookup_name','alt_name', 'inventory', 'location'])

(
    df
        .withColumn('lookup_name', explode(array(col('lookup_name'), col('alt_name'))))
        .drop('alt_name')
        .show(10, False)
)
# +-----------+---------+--------+
# |lookup_name|inventory|location|
# +-----------+---------+--------+
# |Honda      |1        |au      |
# |Car        |1        |au      |
# |Apple      |1        |us      |
# |Fruit      |1        |us      |
# +-----------+---------+--------+

array(col('lookup_name'), col('alt_name')) => ['Honda', 'Car']

df.withColumn('lookup_name', array(col('lookup_name'), col('alt_name'))).show(10, False)
# +--------------+--------+---------+--------+
# |lookup_name   |alt_name|inventory|location|
# +--------------+--------+---------+--------+
# |[Honda, Car]  |Car     |1        |au      |
# |[Apple, Fruit]|Fruit   |1        |us      |
# +--------------+--------+---------+--------+

pyspark.sql.functions.explode(col) Returns a new row for each element in the given array or map.

Linus
  • 636
  • 2
  • 5