0

I have a huge list of tuples :

ijs = [(0,1),(0,2),(0,3), (3,2)...]

For a given value v, I want to get only the (i,j) pairs that have either i=v or j=v (from all possible (i,j) pairs stored in ijs).

E.g. for v=0 and given ijs = [(0,1),(0,2),(0,3), (3,2)], then I should get back only_current = [(0,1),(0,2),(0,3)]


Example:

Please ignore the first 3 lines where I built a list ijs having tuples inside.

import numpy as np

# IGNORE THIS PART UNTIL THE MAIN LOOP 
N= 1000
indices_upper_triangle = np.triu_indices(N, k = 1) # k=1 above diagonal
i,j = indices_upper_triangle
ijs = [(i,j) for i,j in zip(i,j)] # create (i,j) positions

# MAIN LOOP HERE 
# Main loop
all_neig_results = list()
for v in range(N): # for each point

    # from all possible (i,j) pairs, get only (i,j) pairs that have either i=v or j=v
    only_current = [item for item in ijs if v in item]

    all_neig_results.append(only_current)

The list comprehension in the loop is super slow.

%timeit [item for item in ijs if v in item]
15.9 s ± 361 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If I remove the check argument if v in item:

%timeit [item for item in ijs]
1.28 s ± 90.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How can I optimize [item for item in ijs if v in item]?

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • tried using `np.ma.masked_array`? It would be way faster than using python lists. – Tirth Patel Jun 18 '20 at 12:13
  • 1
    thanks for the suggestion but my problem is the list comprehensions in the loop. I will rephrase my question – seralouk Jun 18 '20 at 12:17
  • updated to state clearly my problem – seralouk Jun 18 '20 at 12:34
  • Have you tried a filter with a lambda function? – kubatucka Jun 18 '20 at 12:44
  • Hi. No. Would that help? Can you provide an answer/example? – seralouk Jun 18 '20 at 12:45
  • I know you edited your question, but if you're still using euclidean distance to find neighbors, try using a KDTree https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html – bogovicj Jun 18 '20 at 12:47
  • output= list(filter(lambda x: x[0]==v or x[1] ==v , ijs)) I just tried it: 232 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) – kubatucka Jun 18 '20 at 12:48
  • @kubatucka using filter appears to be even slower so this cannot be used – seralouk Jun 18 '20 at 12:53
  • @bogovicj I will have a look at it thanks – seralouk Jun 18 '20 at 12:53
  • @bogovicj could you point me to an example of the `KDTree` usage cause the documentation is a bit unclear? – seralouk Jun 18 '20 at 13:12
  • @seralouk, sure. I have in mind using the form t = KDTree( distances ); t.query( pt, k=1); # to get the single nearest neighbor of point pt. This example may help but uses the query_ball_tree. which finds other points within a distance threshold. https://stackoverflow.com/questions/27523982/how-to-find-set-of-points-in-x-y-grid-using-kdtree-query-ball-tree (You'll have to repeat that query for every point, well, half of the points, since if y is x's nearest neighbor, you'll know that x is y's nearest neighbor) – bogovicj Jun 18 '20 at 13:19
  • i see. I will have a look at it thanks. I think the first command should be `t = KDTree( pts)` with `points` being all my points and not `t = KDTree( distances )`, right? In this case, `scipy's pdist` is not needed at all if i understand corrently. Also, i need to find the neighbours of a point i within a radius.. If the distance is less than a value, then these are the neighbors of point i. – seralouk Jun 18 '20 at 13:43
  • oops you're right about pts vs distances, my bad ;). especially silly of me since the whole point is to avoid needing all pairwise dists – bogovicj Jun 18 '20 at 14:13
  • Thanks ! insane nice model this KDtree. If you write an answer I accept it immediately – seralouk Jun 18 '20 at 14:36
  • Please don't accept answers answering something else than the question asked! It's useless to anyone seeing this in the future. (I know you changed the question, it's also better to avoid this) – One Lyner Jun 18 '20 at 15:18

2 Answers2

2

You are doing O(N^2) work here, because you go over the whole list each time in the inner loop.

You should precompute once:

import collections
def precompute(pairs):
    d = collections.defaultdict(list)
    for ij in pairs:
        d[ij[0]].append(ij)
        d[ij[1]].append(ij)
    return d

d = precompute(ijs)

to make the work in the inner loop fast:

for v in range(N):
    only_current = d[v]

this way you are doing only an O(N) work.

One Lyner
  • 1,964
  • 1
  • 6
  • 8
1

If your goal is to find nearest neighbors, I'd recommend using a KDTree, which

can be used to rapidly look up the nearest neighbors of any point.

and avoids having to loop over all pairs of points / indexes.

See this example: How to find set of points in x,y grid using KDTree.query_ball_tree

Concerning the leafsize parameter: KDTrees "split up" space into different regions. Each leaf corresponds to a region. if leafsize =1 then it continues to make splits until each region contains only one point. This means building the tree will take a long time, but searches will be fast. If the leaves contain more points, the tree is shallower so building it will be faster. But queries will probably be slower. the "brute-force" bit in the doc refers to the fact that if you have more than 1 point in each region, you need to brute force to figure out nearest neighbors among those in the region.

leafsize=10 will mean the leaf has 10 or less. Each split is such that child nodes have (approximately) half of the points in the parent node, so the precise number of points in each leaf will be variable. there are other implementation details that could affect this.

bogovicj
  • 363
  • 2
  • 9
  • do you happen to know that does the input parameter `leafsize` mean? The documentation says `The number of points at which the algorithm switches over to brute-force. ` but what does this mean ? – seralouk Jun 18 '20 at 19:38
  • KDTrees "split up" space into different regions. Each leaf corresponds to a region. if `leafsize =1` then it continues to make splits until each region contains only one point. This means building the tree will take a long time, but searches will be fast. If the leaves contain more points, the tree is shallower so building it will be faster. But queries will probably be slower. the "brute-force" bit in the doc refers to the fact that if you have more than 1 point in each region, you need to brute force to figure out nearest neighbors among those in the region. – bogovicj Jun 18 '20 at 19:46
  • 1
    Great explanation thanks -- I added it to your answer for completeness. So `leafsize=10` (the default) means that each leaf with have 10? or at least (or most) 10? – seralouk Jun 18 '20 at 20:18
  • Good call on adding to the answer :). added a comment on this point to it just now – bogovicj Jun 18 '20 at 21:12