4

I have a tree. It has a flat bottom. We're only interested in the bottom-most leaves, but this is roughly how many leaves there are at the bottom...

2 x 1600 x 1600 x 10 x 4 x 1600 x 10 x 4

That's ~13,107,200,000,000 leaves? Because of the size (the calculation performed on each leaf seems unlikely to be optimised to ever take less than one second) I've given up thinking it will be possible to visit every leaf.

So I'm thinking I'll build a 'smart' leaf crawler which inspects the most "likely" nodes first (based on results from the ones around it). So it's reasonable to expect the leaves to be evaluated in branches/groups of neighbours, but the groups will vary in size and distribution.

What's the smartest way to record which leaves have been visited and which have not?

John Mee
  • 50,179
  • 34
  • 152
  • 186

3 Answers3

1

It seems that you're looking for a quick and efficient ( in terms of memory usage ) way to do a membership test. If so and if you can cope with some false-positives go for a bloom filter.

Bottom line is : Use bloom filters in situations where your data set is really big AND all what you need is checking if a particular element exists in the set AND a small chance of false positives is tolerable.

Some implementation for Python should exist.

Hope this will help.

John Mee
  • 50,179
  • 34
  • 152
  • 186
sitifensys
  • 2,024
  • 16
  • 29
1

You don't give a lot of information, but I would suggest tuning your search algorithm to help you keep track of what it's seen. If you had a global way of ranking leaves by "likelihood", you wouldn't have a problem since you could just visit leaves in descending order of likelihood. But if I understand you correctly, you're just doing a sort of hill climbing, right? You can reduce storage requirements by searching complete subtrees (e.g., all 1600 x 10 x 4 leaves in a cluster that was chosen as "likely"), and keeping track of clusters rather than individual leaves.

It sounds like your tree geometry is consistent, so depending on how your search works, it should be easy to merge your nodes upwards... e.g., keep track of level 1 nodes whose leaves have all been examined, and when all children of a level 2 node are in your list, drop the children and keep their parent. This might also be a good way to choose what to examine: If three children of a level 3 node have been examined, the fourth and last one is probably worth examining too.

Finally, a thought: Are you really, really sure that there's no way to exclude some solutions in groups (without examining every individual one)? Problems like sudoku have an astronomically large search space, but a good brute-force solver eliminates large blocks of possibilities without examining every possible 9 x 9 board. Given the scale of your problem, this would be the most practical way to attack it.

alexis
  • 48,685
  • 16
  • 101
  • 161
0

Maybe this is too obvious, but you could store your results in a similar tree. Since your computation is slow, the results tree should not grow out of hand too quickly. Then just look up if you have results for a given node.

Janne Karila
  • 24,266
  • 6
  • 53
  • 94
  • By my calculations, even if I did devise a way to keep it down to one bit per leaf I'd need 1.5 TeraBytes of storage. – John Mee Sep 01 '12 at 09:54
  • @JohnMee How many leaves are you going to process? My idea was that the results tree would only contain results and not allocate any space for missing results. – Janne Karila Sep 01 '12 at 10:15