20

I have read somewhere that you can store python objects (more specifically dictionaries) as binaries in MongoDB by using BSON. However right now I cannot find any any documentation related to this.

Would anyone know how exactly this can be done?

chiffa
  • 2,026
  • 3
  • 26
  • 41
  • 1
    It's not at all clear what you're trying to do, what you've tried and what didn't work. Please edit the question to include those helpful details. :) – WiredPrairie Aug 06 '13 at 20:48
  • 2
    If you're doing that for performance, [this benchmark](http://kovshenin.com/2010/pickle-vs-json-which-is-faster/) might surprise you. – georg Aug 06 '13 at 21:15
  • @thg435: Thanks for the link, I will keep it in mind for a project where I/O would be more critical for the performance of my project! – chiffa Aug 06 '13 at 22:46
  • @thg435: the major problem for me is that I rely heavily on serialization of numpy data types, which is not supported by the python's json module – chiffa Feb 23 '14 at 19:31
  • As a side note, using Pickle (as suggested in the answers) can have some issues: http://pyvideo.org/video/2566/pickles-are-for-delis-not-software. In summary - problems with security + maintainability of your code. – tushar747 Jul 02 '15 at 08:53

3 Answers3

42

There isn't a way to store an object in a file (database) without serializing it. If the data needs to move from one process to another process or to another server, it will need to be serialized in some form to be transmitted. Since you're asking about MongoDB, the data will absolutely be serialized in some form in order to be stored in the MongoDB database. When using MongoDB, it's BSON.

If you're actually asking about whether there would be a way to store a more raw form of a Python object in a MongoDB document, you can insert a Binary field into a document which can contain any data you'd like. It's not directly queryable in any way in that form, so you're potentially loosing a lot of the benefits of using a NoSQL document database like MongoDB.

>>> from pymongo import MongoClient
>>> client = MongoClient('localhost', 27017)
>>> db = client['test-database']
>>> coll = db.test_collection    
>>> # the collection is ready now 
>>> from bson.binary import Binary
>>> import pickle
>>> # create a sample object
>>> myObj = {}
>>> myObj['demo'] = 'Some demo data'
>>> # convert it to the raw bytes
>>> thebytes = pickle.dumps(myObj)
>>> coll.insert({'bin-data': Binary(thebytes)})
WiredPrairie
  • 58,954
  • 17
  • 116
  • 143
  • Thanks for the extensive answer! After all I think I will stick with `pickle` serialization, to build a JSON object. It outputs identical string for sets containing same strings, which is critical for me. in addition my I/O to the database isn't the most performance-critical part of my code. – chiffa Aug 06 '13 at 22:45
  • 1
    There is a typo in the example code: it should read pickle.dumps(myObj) on the before-the-last line – Christophe Mar 04 '15 at 12:42
  • 1
    Thanks , pickle.dumps(obj) worked for me (http://scikit-learn.org/stable/modules/model_persistence.html#persistence-example) – Spl2nky Jul 05 '16 at 17:48
  • I guess should change the answert as pickle is changed now and it should be pickle.dumps(obj) and not pickle.dump(obj) – Kishan Mehta Dec 21 '16 at 18:28
  • 3
    so what preprocessing is needed if we want to read the data again from Mongo? – Luk Aron Jul 15 '20 at 08:25
5

Assuming you are not specifically interested in mongoDB, you are probably not looking for BSON. BSON is just a different serialization format compared to JSON, designed for more speed and space efficiency. On the other hand, pickle does more of a direct encoding of python objects.

However, do your speed tests before you adopt pickle to ensure it is better for your use case.

superdud
  • 91
  • 4
0

It seems you would still need to serialize using pickle module that would create bytes and de-serializing these bytes with pickle will directly provide python object.

Also, you can store pickled object directly into Mongo.

import pickle as pkl
from uuid import uuid4

from pymongo import MongoClient

data = dict(key='mongo')
picked_data = pkl.dumps(data)
uid = uuid4()

client = MongoClient() # add DB url in the constructor if needed
db = client.test

# insertion
db.data.insert_one({
    'uuid': uid,
    'data': picked_data
})

# retrieval
result = db.data.find_one({'uuid': uid})
assert pkl.loads(result['data']) == data