15

I have a DataFrame(df) in pyspark, by reading from a hive table:

df=spark.sql('select * from <table_name>')


+++++++++++++++++++++++++++++++++++++++++++
|  Name    |    URL visited               |
+++++++++++++++++++++++++++++++++++++++++++
|  person1 | [google,msn,yahoo]           |
|  person2 | [fb.com,airbnb,wired.com]    |
|  person3 | [fb.com,google.com]          |
+++++++++++++++++++++++++++++++++++++++++++

When i tried the following, got an error

df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."

type(df.name) is of 'pyspark.sql.column.Column'

How do i create a dictionary like the following, which can be iterated later on

{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}

Appreciate your thoughts and help.

Danny Varod
  • 17,324
  • 5
  • 69
  • 111
user8946942
  • 169
  • 1
  • 1
  • 3

4 Answers4

24

I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.

Something like:

df.rdd.map(lambda row: row.asDict())
Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
Cosmin
  • 676
  • 7
  • 15
  • Note this will produce a rows of the form: `{"Name": "person1", "URL Visited": ["google","msn","yahoo"] }` which is not the output that the OP asked for, though it's quite easy to change the `map` function to address this. – pault Jun 10 '19 at 18:58
17

How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.

df_list_of_dict = [row.asDict() for row in df.collect()]

type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)

df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
user9074332
  • 2,336
  • 2
  • 23
  • 39
6

If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.

First collect the data:

df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]

This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:

df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]

1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.

pault
  • 41,343
  • 15
  • 107
  • 149
2

Given:

+++++++++++++++++++++++++++++++++++++++++++
|  Name    |    URL visited               |
+++++++++++++++++++++++++++++++++++++++++++
|  person1 | [google,msn,yahoo]           |
|  person2 | [fb.com,airbnb,wired.com]    |
|  person3 | [fb.com,google.com]          |
+++++++++++++++++++++++++++++++++++++++++++

This should work:

df_dict = df \
    .rdd \
    .map(lambda row: {row[0]: row[1]}) \
    .collect()

df_dict

#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]

This way you just collect after processing.

Please, let me know if that works for you :)

sneaky_lobster
  • 375
  • 3
  • 16