Using Pickle vs database for loading large amount of data?

Question

I have previously saved a dictionary which maps image_name -> list of feature vectors, with the file being ~32 Gb. I have been using cPickle to load the dictionary in, but since I only have 8 GB of RAM, this process takes forever. Someone suggested using a database to store all the info, and reading from that, but would that be a faster/better solution than reading a file from disk? Why?

Really _REALLY_ depends on what type of data, and how much. You might also need to think about storing only principle components and recomputing the rest on-the-fly, if it is computationally feasible — inspectorG4dget, Aug 04 '14 at 16:08
I'm storing sift descriptors, which take forever to build, so I really cant compute them every time. They are 1 x 128 vectors http://docs.opencv.org/trunk/doc/py_tutorials/py_feature2d/py_sift_intro/py_sift_intro.html — user1835351, Aug 04 '14 at 16:10
You really need to consider whether you need all of that data to be available at the same time. If you have a lot of images, you probably only need one at a time and you are probably not going to do anything with the others. I am pretty sure that selecting a few images one by one from a database will be much faster than loading 32GB of data at once ;) — BrtH, Aug 04 '14 at 16:58
I do need it all at the same time, since I am running kmeans clustering on the entire list of descriptors, so I dont think there's a way around not having them in main memory — user1835351, Aug 04 '14 at 17:01
If you truly are doing calculations that require 32GB of data to be in memory, then a database won't help you. I would rethink your algorithm or get more RAM. Also see http://stackoverflow.com/questions/6372397/k-means-with-really-large-matrix — derricw, Aug 04 '14 at 18:12
You say "this process takes forever" but are you actually able to load the full 32GB? If you can then I'd say that the data size is much higher on disk, or you're using swap space; most likely a combination of both. — dsemi, Aug 13 '14 at 14:11
Could you be more specific than "forever"? How long is it really taking? It may be that there's very little that can be improved (unless you can get away with reading less data) because your bottleneck is I/O. — dsemi, Aug 13 '14 at 14:50

score -1 · Accepted Answer · answered Aug 13 '14 at 14:13

-1

Use a database because it allows you to query faster. I've done this before. I would suggest against using cPickle. What specific implementation are you using?

answered Aug 13 '14 at 14:13

becauseComputers

26
2

1

inmemory dictionary (=hashmap) is mostly faster than a db (if it is in memory and not in swap). if there are really 32gb needed i would use a db, not because of speed but because of memory. – linluk Aug 13 '14 at 14:22
use mongodb, much more flexible than pickle – Nickpick Jun 28 '21 at 13:27

Using Pickle vs database for loading large amount of data?

1 Answers1