1

I'd like to make a dictionary from two columns filtered from DataFrame. Content of the first column should be dictionary's key and from the second should be values of given key.

Example:

keys vals
203 4
203 3
203 6
412 33
412 123

Such a dataframe I want transform to:

final_dict = {
   "203": [4, 3, 6],
   "412": [33, 123]
}

Is there any fast method to avoid loops? Are they necessary here?

Ric S
  • 9,073
  • 3
  • 25
  • 51
dawid2312
  • 29
  • 6
  • Does this answer your question? [dataframe to dict such that one column is the key and the other is the value](https://stackoverflow.com/questions/53941224/dataframe-to-dict-such-that-one-column-is-the-key-and-the-other-is-the-value) – RandomGuy Dec 21 '22 at 16:52
  • What have you tried so far? – scr Dec 21 '22 at 16:57
  • No, this answer was for pandas dataframe, not pyspark – dawid2312 Dec 22 '22 at 09:33

1 Answers1

1

One way to do it is to use the function collect_list to get all the values from a group (use collect_set if you want distinct values instead):

import pyspark.sql.functions as F

lst = df.groupby('keys').agg(F.collect_list('vals').alias('vals')).collect()

print({str(i[0]): i[1] for i in lst})
# {'412': [33, 123], '203': [4, 6, 3]}

Note that the .collect() command could take time if you have a large dataframe.

Ric S
  • 9,073
  • 3
  • 25
  • 51