Dataframe pyspark to dict

Question

I have this dataframe path_df:

path_df.show()
+---------------+-------------+----+
|FromComponentID|ToComponentID|Cost|
+---------------+-------------+----+
|            160|          163|27.0|
|            160|          183|27.0|
|            161|          162|22.0|
|            161|          170|31.0|
|            162|          161|22.0|
|            162|          167|24.0|
|            163|          160|27.0|
|            163|          164|27.0|
|            164|          163|27.0|
|            164|          165|35.0|
|            165|          164|35.0|
|            165|          166|33.0|
|            166|          165|33.0|
|            166|          167|31.0|
|            167|          162|24.0|
|            167|          166|31.0|
|            167|          168|27.0|
|            168|          167|27.0|
|            168|          169|23.0|
|            169|          168|23.0|
+---------------+-------------+----+
only showing top 20 rows

From this, I want to make a dictionnary, as follow: {FromComponentID:{ToComponentID:Cost}}

For my current data, it would be:

{160 : {163 : 27,
        183 : 27},
 161 : {162 : 22,
        170 : 31},
 162 : {161 : 22
        167 : 24},
 ...
 167 : {162 : 24,
        166 : 31,
        168 : 27}
 168 : {167 : 27,
        169 : 23},
 169 : {168 : 23}
}

Can I do that using only PySpark and how ? Or maybe it's better to extract my data and process them directly with python.

DavidWayne · Accepted Answer · 2017-12-06T15:33:22.430

You can do all of this with dataframe transformations and udfs. The only slightly annoying thing is that, because you technically have two different types of dictionaries (one where key=integer and value=dictionary, the other where key=integer value=float), you will have to define two udfs with different datatypes. Here is one possible way to do this:

from pyspark.sql.functions import udf,collect_list,create_map
from pyspark.sql.types import MapType,IntegerType,FloatType

data = [[160,163,27.0],[160,183,27.0],[161,162,22.0],
      [161,170,31.0],[162,161,22.0],[162,167,24.0],
      [163,160,27.0],[163,164,27.0],[164,163,27.0],
      [164,165,35.0],[165,164,35.0],[165,166,33.0],
      [166,165,33.0],[166,167,31.0],[167,162,24.0],
      [167,166,31.0],[167,168,27.0],[168,167,27.0],
      [168,169,23.0],[169,168,23.0]]

cols = ['FromComponentID','ToComponentID','Cost']
df = spark.createDataFrame(data,cols)

combineMap = udf(lambda maps: {key:f[key] for f in maps for key in f},
             MapType(IntegerType(),FloatType()))

combineDeepMap = udf(lambda maps: {key:f[key] for f in maps for key in f},
             MapType(IntegerType(),MapType(IntegerType(),FloatType())))

mapdf = df.groupBy('FromComponentID')\
.agg(collect_list(create_map('ToComponentID','Cost')).alias('maps'))\
.agg(combineDeepMap(collect_list(create_map('FromComponentID',combineMap('maps')))))

result_dict = mapdf.collect()[0][0]

For a large dataset, this should offer some performance boosts over a solution that requires the data to be collected onto a single node. But since spark still has to serialize the udf, there won't be huge gains over an rdd based solution.

Update:

An rdd solution is a lot more compact but, in my opinion, it is not as clean. This is because pyspark doesn't store large dictionaries as rdds very easily. The solution is to store it as a distributed list of tuples and then convert it to a dictionary when you collect it to a single node. Here is one possible solution:

maprdd = df.rdd.groupBy(lambda x:x[0]).map(lambda x:(x[0],{y[1]:y[2] for y in x[1]}))
result_dict = dict(maprdd.collect())

Again, this should offer performance boosts over a pure python implementation on single node, and it might not be that different than the dataframe implementation, but my expectation is that the dataframe version will be more performant.

I'm interested in a RDD based solution if you have. But otherwise, this one works fine. Good job. — Steven, Dec 06 '17 at 09:19
I have tried rdd syntax with `collectAsMap()` func and it was exactly what I wanted — A.Ametov, Nov 12 '19 at 12:05

score 2 · Answer 2 · answered Dec 05 '17 at 15:24

2

Easiest way I know is the below (but has Pandas dependency):

path_df.toPandas().set_index('FromComponentID').T.to_dict('list')

answered Dec 05 '17 at 15:24

user8834780

1,620
3
21
48

Doesn't work. The output is a list, and it omits duplicated values. – Steven Dec 06 '17 at 09:15
Surprisingly, converting to Pandas is at least 3 times faster than using answer's rdd variant. But it returns list packed in another list for each key – A.Ametov Nov 12 '19 at 12:12

score 0 · Answer 3 · answered Dec 05 '17 at 15:10

0

You can try this way

df_prod = spark.read.csv('/path/to/sample.csv',inferSchema=True,header=True)
rdd = df_prod.rdd.map(lambda x: {x['FromComponentID']:{x['ToComponentID']:x['Cost']}})
rdd.collect()

answered Dec 05 '17 at 15:10

ravee

109
7

1

This doesn't work, you need to use something like `groupByKey` to combine into a single dictionary. – DavidWayne Dec 05 '17 at 20:10
1

The result is a list of n dicts, where n is the number of lines of dataframe – Steven Dec 06 '17 at 09:18

Dataframe pyspark to dict

3 Answers3

Linked