0

I have a large dataset, 11 million rows, and I loaded the data into pandas. I want to then build a spatial index, like rtree or quad tree, but when I push it into memory, it consumes a ton of RAM along with the already reading the large file.

To help reduce the memory footprint, I was thinking of trying to push the index to disk. Can you store the tree in a table? Or even a dataframe and store it in hdf table? Is there a better strategy?

Thanks

JabberJabber
  • 341
  • 2
  • 17
  • This question is a bit off-topic. I am fairly certain mysql can handle storing and retrieving trees. – James Apr 15 '17 at 02:30
  • I am not sure what this question means but what about reading the dataset in batches in pandas? – Peaceful Apr 15 '17 at 04:03
  • @peaceful I'm trying to ask if I have a really large dataset, and I want to not but an rtree index into memory, is there a strategy to do this, or an existing package? – JabberJabber Apr 15 '17 at 10:56
  • Openstreet has a number of tools for dealing with spatial data, check out the wiki (http://wiki.openstreetmap.org/wiki/Downloading_data), it links to various tools (Osmosis, osmconvert, osmfilter, ...). – TilmannZ Apr 16 '17 at 10:26

1 Answers1

0

Yes, r-trees can be stored on disk easily. (It's much harder with KD-trees and quad-trees)

That is why the index is block oriented - the block size is meant to be chosen to match hour drive.

I don't use pandas, and will not give a library recommendation.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194