What are efficient ways to store millions of tuples in memory in Python?

Question

I am trying to parse multiple large .txt files (in fact .xyz files) which are containing xyz coordinates. Each file has got millions of lines in it, where each line is presenting one coordinate (xyz) separated by a comma. I only want to keep certain coordinates which are inside a specific bounding box. After having found those coordinates, I want to keep them to do multiple more specific lookups. Therefore I wanted to store them spatially indexed via a quadtree.

The opening and reading of all files is obviously taking it's time but what's worse is that I am having serious memory problems. The bottleneck seems to be the inserting of a tuple containing the current coordinate and a bounding box into the quadtree. After having processed some files my virtual memory goes up to 10GB and more.

What I have tried so far is using a new .txt file so I would not have to keep everything in memory. But in fact this is not faster at all. I also tried to use a sqlite database but this didn't do the trick either. Downside of these attempts would also be that I lose the spatial indexing via a quadtree. Is there anything I can do to stick to the quadtree attempt and lower the memory consumption?

def pointcloud_thin(log, xyz_files, bbox):

   # ... 
   x_min, y_min = bbox_points[0]
   x_max, y_max = bbox_points[2]

   # Using a quertree to store values from new pointcloud for better performance
   # (xmin, ymin, xmax, ymax)
   spindex = Index(bbox=(x_min, y_min, x_max, y_max))

   for i, file in enumerate(xyz_files):
       with open(file) as f:
           for line in f:
               try:
                   #xyz = list(map(float, line.split(',')))
                   x, y, z = line.split(",")
                   if (float(x) >= x_min and float(x) <= x_max and float(y) >= y_min and float(y) <= y_max):
                       tup = (float(x), float(y), float(z))
                       spindex.insert(tup, (float(x), float(y), float(x), float(y)))
                       #new_file.write(x + "," + y + "," + z) # txt file
                       #pointcloud.save([float(x), float(y), float(z)]) # sqlite3
                   else:
                       pass
               except ValueError:
                   continue

   return spindex

I would suggest using a `numpy` array to store the coordinates. That is as compact as it gets for floating point data. — Roland Smith, Feb 18 '17 at 23:22
I agree with @up , numpy has been created for this kind of tasks, it comes with nice speed and effinency, owes them to cython implementation. — dannyxn, Feb 18 '17 at 23:33
The Python package you're using to store the quadtree probably isn't very efficient. See if [`scipy.spatial.cKDTree`](https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.spatial.cKDTree.html) works any better. — Blender, Feb 18 '17 at 23:39
I have already thought about using numpy arrays but this would mean even more dependencies. In the end I want to use this in a plugin which preferably shall be pure python without any dependencies that need to be installed separately (if somehow possible...). Points (actually tuples) to be inserted would be around 13 million in my current tests (maybe even more in future use cases). — conste, Feb 18 '17 at 23:46
You can use the standard array module to store your coordinate floats more compactly but it will be slower than Numpy. — PM 2Ring, Feb 19 '17 at 01:33

What are efficient ways to store millions of tuples in memory in Python?

0 Answers0