How to make dictionary from two pyspark columns keys and values

Question

I'd like to make a dictionary from two columns filtered from DataFrame. Content of the first column should be dictionary's key and from the second should be values of given key.

Example:

keys	vals
203	4
203	3
203	6
412	33
412	123

Such a dataframe I want transform to:

final_dict = {
   "203": [4, 3, 6],
   "412": [33, 123]
}

Is there any fast method to avoid loops? Are they necessary here?

Does this answer your question? [dataframe to dict such that one column is the key and the other is the value](https://stackoverflow.com/questions/53941224/dataframe-to-dict-such-that-one-column-is-the-key-and-the-other-is-the-value) — RandomGuy, Dec 21 '22 at 16:52

score 1 · Accepted Answer · answered Dec 21 '22 at 17:04

1

One way to do it is to use the function collect_list to get all the values from a group (use collect_set if you want distinct values instead):

import pyspark.sql.functions as F

lst = df.groupby('keys').agg(F.collect_list('vals').alias('vals')).collect()

print({str(i[0]): i[1] for i in lst})
# {'412': [33, 123], '203': [4, 6, 3]}

Note that the .collect() command could take time if you have a large dataframe.

answered Dec 21 '22 at 17:04

Ric S

9,073
3
25
51

Exactly, what I wanted. Thanks! – dawid2312 Dec 22 '22 at 09:34

How to make dictionary from two pyspark columns keys and values

1 Answers1