Fastest way to create Dictionary from pyspark DF

Question

I'm using Snappydata with pyspark to run my sql queries and convert the output DF into a dictionary to bulk insert it into mongo. I've gone through many similar quertions to test the convertion of a spark DF to Dictionary.

Currently I'm using map(lambda row: row.asDict(), x.collect()) this method to convert my bulk DF to dictionary. And it is taking 2-3sec for 10K records.

I've stated below how I impliment my idea:

x = snappySession.sql("select * from test")
df = map(lambda row: row.asDict(), x.collect())
db.collection.insert_many(df)

Is there any faster way?

Anake · Answer 1 · 2017-12-07T11:48:27.523

0

I would look into whether you can directly write to Mongo from Spark, as that will be the best method.

Failing that, you can use this method:

x = snappySession.sql("select * from test")
dictionary_rdd = x.rdd.map(lambda row: row.asDict())

for d in dictionary_rdd.toLocalIterator():
    db.collection.insert_many(d)

This will create all the dictionaries in Spark in a distributed manner. The rows will returned to the driver and inserted into Mongo one row at a time so that you don't run out of memory.

edited Dec 07 '17 at 11:48

answered Dec 07 '17 at 11:04

Anake

7,201
12
45
59

I'm aware of the sending DF directly to mongoDB. From the given [documentation](https://docs.mongodb.com/spark-connector/master/python-api/) there is not db authentication. This is why i've opted this way. – techie95 Dec 07 '17 at 11:15
Thank you@Anake but its taking nearly 12-15sec. Are there any other ways you suggest? – techie95 Dec 08 '17 at 05:29

score 0 · Accepted Answer · answered Dec 07 '17 at 15:08

0

I'd recommend using foreachPartition:

(snappySession
    .sql("select * from test")
    .foreachPartition(insert_to_mongo))

where insert_to_mongo:

def insert_to_mongo(rows):
    client  = ...
    db = ...
    db.collection.insert_many((row.asDict() for row in rows))

answered Dec 07 '17 at 15:08

Alper t. Turker

34,230
9
83
115

did you check or run the code?. it is giving me error `AttributeError: 'itertools.chain' object has no attribute 'asDict'` – techie95 Dec 08 '17 at 05:20

Fastest way to create Dictionary from pyspark DF

2 Answers2