0

I'm using Snappydata with pyspark to run my sql queries and convert the output DF into a dictionary to bulk insert it into mongo. I've gone through many similar quertions to test the convertion of a spark DF to Dictionary.

Currently I'm using map(lambda row: row.asDict(), x.collect()) this method to convert my bulk DF to dictionary. And it is taking 2-3sec for 10K records.

I've stated below how I impliment my idea:

x = snappySession.sql("select * from test")
df = map(lambda row: row.asDict(), x.collect())
db.collection.insert_many(df)

Is there any faster way?

techie95
  • 515
  • 3
  • 16

2 Answers2

0

I would look into whether you can directly write to Mongo from Spark, as that will be the best method.

Failing that, you can use this method:

x = snappySession.sql("select * from test")
dictionary_rdd = x.rdd.map(lambda row: row.asDict())

for d in dictionary_rdd.toLocalIterator():
    db.collection.insert_many(d)

This will create all the dictionaries in Spark in a distributed manner. The rows will returned to the driver and inserted into Mongo one row at a time so that you don't run out of memory.

Anake
  • 7,201
  • 12
  • 45
  • 59
  • I'm aware of the sending DF directly to mongoDB. From the given [documentation](https://docs.mongodb.com/spark-connector/master/python-api/) there is not db authentication. This is why i've opted this way. – techie95 Dec 07 '17 at 11:15
  • Thank you@Anake but its taking nearly 12-15sec. Are there any other ways you suggest? – techie95 Dec 08 '17 at 05:29
0

I'd recommend using foreachPartition:

(snappySession
    .sql("select * from test")
    .foreachPartition(insert_to_mongo))

where insert_to_mongo:

def insert_to_mongo(rows):
    client  = ...
    db = ...
    db.collection.insert_many((row.asDict() for row in rows))
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • did you check or run the code?. it is giving me error `AttributeError: 'itertools.chain' object has no attribute 'asDict'` – techie95 Dec 08 '17 at 05:20