I am trying to parse multiple large .txt files (in fact .xyz files) which are containing xyz coordinates. Each file has got millions of lines in it, where each line is presenting one coordinate (xyz) separated by a comma. I only want to keep certain coordinates which are inside a specific bounding box. After having found those coordinates, I want to keep them to do multiple more specific lookups. Therefore I wanted to store them spatially indexed via a quadtree.
The opening and reading of all files is obviously taking it's time but what's worse is that I am having serious memory problems. The bottleneck seems to be the inserting of a tuple containing the current coordinate and a bounding box into the quadtree. After having processed some files my virtual memory goes up to 10GB and more.
What I have tried so far is using a new .txt file so I would not have to keep everything in memory. But in fact this is not faster at all. I also tried to use a sqlite database but this didn't do the trick either. Downside of these attempts would also be that I lose the spatial indexing via a quadtree. Is there anything I can do to stick to the quadtree attempt and lower the memory consumption?
def pointcloud_thin(log, xyz_files, bbox):
# ...
x_min, y_min = bbox_points[0]
x_max, y_max = bbox_points[2]
# Using a quertree to store values from new pointcloud for better performance
# (xmin, ymin, xmax, ymax)
spindex = Index(bbox=(x_min, y_min, x_max, y_max))
for i, file in enumerate(xyz_files):
with open(file) as f:
for line in f:
try:
#xyz = list(map(float, line.split(',')))
x, y, z = line.split(",")
if (float(x) >= x_min and float(x) <= x_max and float(y) >= y_min and float(y) <= y_max):
tup = (float(x), float(y), float(z))
spindex.insert(tup, (float(x), float(y), float(x), float(y)))
#new_file.write(x + "," + y + "," + z) # txt file
#pointcloud.save([float(x), float(y), float(z)]) # sqlite3
else:
pass
except ValueError:
continue
return spindex