5

Let's consider, there are two arrays I and J which determine the neighbor pairs:

I = np.array([0, 0, 1, 2, 2, 3])
J = np.array([1, 2, 0, 0, 3, 2])

Which means element 0 has two neighbors 1 and 2. Element 1 has only 0 as a neighbor and so on.

What is the most efficient way to create arrays of all neighbor triples I', J', K' such that j is neighbor of i and k is neighbor of j given the condition i, j, and k are different elements (i != j != k)?

Ip = np.array([0, 0, 2, 3])
Jp = np.array([2, 2, 0, 2])
Kp = np.array([0, 3, 1, 0])

Of course, one way is to loop over each element. Is there a more efficient algorithm? (working with 10-500 million elements)

iacob
  • 20,084
  • 6
  • 92
  • 119
Roy
  • 65
  • 2
  • 15
  • 40
  • 3
    Why are you using arrays? This is a graph problem: use `networkx` or other graph package. Let the package worry about your efficiency. – Prune Mar 02 '21 at 07:32
  • 1
    Please clearly specify your use case. Do you need to generate all possible triples? Is this an on-demand service? Is it worth memoization? – Prune Mar 02 '21 at 07:33
  • @Prune Thanks, Will take a look into networkx package. But not sure what do you mean by all possible triples. Is there a way to get it part by part? – Roy Mar 02 '21 at 07:37
  • 1
    You say that you want to create triples in which the middle node is connected to each of the end nodes. However, you don't specify anything about how you decide which triples to produce. Knowing the program specs is important to choosing a design and implementation. – Prune Mar 02 '21 at 07:40
  • Yes, I meant all the possible triples. The example also lists all the possible ones. – Roy Mar 02 '21 at 07:44
  • Your comments on my answer now seem to reduce the problem well enough -- but that you're asking for libraries or other packaged resources to handle your problem. This is specifically off-topic for Stack Overflow. However, if you *choose* a package for your data structures and post your implementation, we may well be able to work on your algorithm. – Prune Mar 02 '21 at 19:06
  • @Roy, if getting the number of such triplets per vertex instead of all the triplets suffices, there is an elegant and fast solution. Compute the adjacency matrix of your graph (as done in my solution below) and square it. The i-jth entry will contain the number of length 2 paths from vertex i to vertex j. – Shir Mar 05 '21 at 12:04
  • 1
    adjacency matrix for a million rows... – anon01 Mar 07 '21 at 02:35

4 Answers4

5

I would go with a very simple approach and use pandas (I and J are your numpy arrays):

import pandas as pd

df1 = pd.DataFrame({'I': I, 'J': J})
df2 = df1.rename(columns={'I': 'K', 'J': 'I'})

result = pd.merge(df2, df1, on='I').query('K != J')

The advantage is that pandas.merge relies on a very fast underlying numerical implementation. Also, you can make the computation even faster for example by merging using indexes.

To reduce the memory that this approach needs, it would be probably very useful to reduce the size of df1 and df2 before merging them (for example, by changing the dtype of their columns to something that suits your need).

Here is an example of how to optimize speed and memory of the computation:

from timeit import timeit
import numpy as np
import pandas as pd

I = np.random.randint(0, 10000, 1000000)
J = np.random.randint(0, 10000, 1000000)

df1_64 = pd.DataFrame({'I': I, 'J': J})
df1_32 = df1_64.astype('int32')
df2_64 = df1_64.rename(columns={'I': 'K', 'J': 'I'})
df2_32 = df1_32.rename(columns={'I': 'K', 'J': 'I'})

timeit(lambda: pd.merge(df2_64, df1_64, on='I').query('K != J'), number=1)
# 18.84
timeit(lambda: pd.merge(df2_32, df1_32, on='I').query('K != J'), number=1)
# 9.28
Riccardo Bucco
  • 13,980
  • 4
  • 22
  • 50
  • 1
    This is exactly what I was looking for. Very smart solution! Thanks for sharing it. – Roy Mar 12 '21 at 11:30
1

There is no particularly magic algorithm to generate all of the triples. You can avoid re-fetching a node's neighbors by an orderly search, but that's about it.

  • Make an empty list, N, of nodes to check.
  • Add some start node, S, to N
  • While N is not empty
    • Pop a node off the list; call it A.
    • Make a set of its neighbors, A'.
    • for each neighbor B of A
      • for each element a of A'
        • Generate the triple (a, A, B)
      • Add B to the list of nodes to check, if it has not already been checked.

Does that help? There are still several details to handle in the algorithm above, such as avoiding duplicate generation, and fine points of moving through cliques.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Thanks, however I am trying to avoid list and loop here. Something based on itertools maybe or tricks like bincount. – Roy Mar 02 '21 at 08:06
  • 3
    Itertools merely hides the loop from you, and replaces the explicit loop with a generator. Any particular solution will depend on your data structures. If you make the expected coding attempt and get stuck, you'll have a good question to post. This one is headed off the rails for Stack Overflow's charter. – Prune Mar 02 '21 at 08:09
  • That said, your triple generation should be straightforward, a nested list comprehension with check that you're not making (A, B, A) triples. – Prune Mar 02 '21 at 08:10
  • I edited question to clarify ABA. The whole point is efficiency and moving loops from python to numerical libraries or using array instead of list are usually the best approaches for speeding up. – Roy Mar 02 '21 at 08:17
1

This is an initial solution to your problem using networkx, an optimized library for graph computations:

import numpy as np
import networkx as nx

I = np.array([0, 0, 1, 2, 2, 3])
J = np.array([1, 2, 0, 0, 3, 2])

I_, J_, K_ = [], [], [],
num_nodes = np.max(np.concatenate([I,J])) + 1
A = np.zeros((num_nodes, num_nodes))
A[I,J] = 1
print("Adjacency Matrix:")
print(A)
G = nx.from_numpy_matrix(A)

for i in range(num_nodes):
    first_neighbors = list(G.neighbors(i))

    for j in first_neighbors:
        second_neighbor = list(G.neighbors(j))
        second_neighbor_no_circle = list(filter(lambda node: node != i, second_neighbor))
        num_second_neighbors = len(second_neighbor_no_circle)

        if num_second_neighbors > 0:
            I_.extend(num_second_neighbors * [i])
            J_.extend(num_second_neighbors * [j])
            K_.extend(second_neighbor_no_circle)
            
I_, J_, K_ = np.array(I_), np.array(J_), np.array(K_)
print("result:")
print(I_)
print(J_)
print(K_)

####### Output ####### 
Adjacency Matrix:
[[0. 1. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 1.]
 [0. 0. 1. 0.]]
result:
[0 1 2 3]
[2 0 0 2]
[3 2 1 0]

I used %%timeit on the code above without print statements to check the running time: 49 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Complexity analysis: Finding all the neighbors of all the neighbors is essentially taking 2 steps in a Depth First Search algorithm. This could take, depending on the graph's topology, up to O(|V| + |E|) where |E| is the number of edges in the graph and |V| is the number of vertices.

To the best of my knowledge, there is no better algorithm on a general graph. However, if you do know some special properties about the graph, the running time could be more tightly bounded or perhaps alter the current algorithm based on this knowledge.

For instance, if you know all the vertices have at most d edges, and the graph has one connected component, the bound of this implementation becomes O(2d) which is quite better if d << |E|.

Let me know if you have any questions.

iacob
  • 20,084
  • 6
  • 92
  • 119
Shir
  • 1,571
  • 2
  • 9
  • 27
  • 1
    Thanks, but it seems this won't work for large number of nodes. The matrix size would increase as N^2. – Roy Mar 05 '21 at 20:56
  • Too bad... Take a look at [this thread](https://datascience.stackexchange.com/questions/13455/large-graphs-networkx-distributed-alternative), they discuss some solutions for handling large graphs. – Shir Mar 06 '21 at 06:12
1

What you are looking for is all paths of length 3 in the graph. You can achieve this simply with the following recursive algorithm:

import networkx as nx

def findPaths(G,u,n):
    """Returns a list of all paths of length `n` starting at vertex `u`."""
    if n==1:
        return [[u]]
    paths = [[u]+path for neighbor in G.neighbors(u) for path in findPaths(G,neighbor,n-1) if u not in path]
    return paths

# Generating graph
vertices = np.unique(I)
edges = list(zip(I,J))
G = nx.Graph()
G.add_edges_from(edges)

# Grabbing all 3-paths
paths = [path for v in vertices for path in findPaths(G,v,3)]
paths
>>> [[0, 2, 3], [1, 0, 2], [2, 0, 1], [3, 2, 0]]
iacob
  • 20,084
  • 6
  • 92
  • 119