3

I have a dataframe in (py)Spark, where 1 of the columns is from the type 'map'. That column I want to flatten or split into multiple columns which should be added to the original dataframe. I'm able to unfold the column with flatMap, however I loose the key to join the new dataframe (from the unfolded column) with the original dataframe.

My schema is like this:

    rroot
 |-- key: string (nullable = true)
 |-- metric: map (nullable = false)
 |    |-- key: string
 |    |-- value: float (valueContainsNull = true)

As you can see, the column 'metric' is a map-field. This is the column that I want to flatten. Before flattening it looks like:

+----+---------------------------------------------------+
|key |metric                                             |
+----+---------------------------------------------------+
|123k|Map(metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6)|
|d23d|Map(metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2)|
|as3d|Map(metric1 -> 2.2, metric2 -> 4.3, metric3 -> 9.0)|
+----+---------------------------------------------------+

To convert that field to columns I do

df2.select('metric').rdd.flatMap(lambda x: x).toDF().show()

which gives

   +------------------+-----------------+-----------------+
|           metric1|          metric2|          metric3|
+------------------+-----------------+-----------------+
|1.2999999523162842|6.300000190734863|7.599999904632568|
|               1.5|              2.0|2.200000047683716|
| 2.200000047683716|4.300000190734863|              9.0|
+------------------+-----------------+-----------------+

However I don't see the key , therefore I don't know how to add this data to the original dataframe.

What I want is:

+----+-------+-------+-------+
| key|metric1|metric2|metric3|
+----+-------+-------+-------+
|123k|    1.3|    6.3|    7.6|
|d23d|    1.5|    2.0|    2.2|
|as3d|    2.2|    4.3|    9.0|
+----+-------+-------+-------+

My question thus is: How can i get df2 back to df (given that i originally don't know df and only have df2)

To make df2:

rdd = sc.parallelize([('123k', 1.3, 6.3, 7.6),
                      ('d23d', 1.5, 2.0, 2.2), 
                      ('as3d', 2.2, 4.3, 9.0)
                          ])
schema = StructType([StructField('key', StringType(), True),
                     StructField('metric1', FloatType(), True),
                     StructField('metric2', FloatType(), True),
                     StructField('metric3', FloatType(), True)])
df = sqlContext.createDataFrame(rdd, schema)


from pyspark.sql.functions import lit, col, create_map
from itertools import chain

metric = create_map(list(chain(*(
    (lit(name), col(name)) for name in df.columns if "metric" in name
)))).alias("metric")


df2 = df.select("key", metric)
5nv
  • 441
  • 2
  • 15
  • It would easier to help if you can make a [reproducible example](https://stackoverflow.com/help/mcve). – Psidom Nov 10 '17 at 17:44
  • yes I second that, if you give some sample data I will flesh out the answer below – ags29 Nov 10 '17 at 17:45

3 Answers3

1

I can select a certain key from a maptype by doing:

df.select('maptypecolumn'.'key')

In my example I did it as follows:

columns= df2.select('metric').rdd.flatMap(lambda x: x).toDF().columns
for i in columns:
  df2= df2.withColumn(i,lit(df2.metric[i]))
Nimantha
  • 6,405
  • 6
  • 28
  • 69
5nv
  • 441
  • 2
  • 15
1
from pyspark.sql.functions import explode

# fetch column names of the original dataframe from keys of MapType 'metric' column
col_names = df2.select(explode("metric")).select("key").distinct().sort("key").rdd.flatMap(lambda x: x).collect()

exprs = [col("key")] + [col("metric").getItem(k).alias(k) for k in col_names]    
df2_to_original_df = df2.select(*exprs)
df2_to_original_df.show()

Output is:

+----+-------+-------+-------+
| key|metric1|metric2|metric3|
+----+-------+-------+-------+
|123k|    1.3|    6.3|    7.6|
|d23d|    1.5|    2.0|    2.2|
|as3d|    2.2|    4.3|    9.0|
+----+-------+-------+-------+
Nimantha
  • 6,405
  • 6
  • 28
  • 69
Prem
  • 11,775
  • 1
  • 19
  • 33
0

You can access key and value for example like this:

from pyspark.sql.functions import explode
df.select(explode("custom_dimensions")).select("key")
ags29
  • 2,621
  • 1
  • 8
  • 14
  • The explode gives a row for each value the column 'custom_dimension', while I want each value from the map 'custom_dimension' in another column. The key that i'm searching for, is the key to join with the rest of the dataframe. While the 'key' in the explode should be the new column name+. – 5nv Nov 10 '17 at 19:45