1

I am trying to iterate over two lists of the same length, and for the pair of entries per index, execute a function. The function aims to cluster the entries according to some requirement X on the value the function returns.

The lists in questions are:

e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]      
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]  

and the function takes 4 entries, every combination of l1 and l2. The function is defined as

def deltaR(e1, p1, e2, p2):
    de = e1 - e2                                                                                                                                                                     
    dp = p1 - p2                                                                                                                                                           
    return de*de + dp*dp

I have so far been able to loop over the lists simultaneously as:

for index, (eta, phi) in enumerate(zip(e_list, p_list)):                                                                                                                    
    for index2, (eta2, phi2) in enumerate(zip(e_list, p_list)):                                                                                                             
        if index == index2: continue             # to avoid same indices                                                                                                                                   
        if deltaR(eta, phi, eta2, phi2) < X:                                                                                                                                      
             print (index, index2) , deltaR(eta, phi, eta2, phi2)

This loops executes the function on every combination, except those that are same i.e. index 0,0 or 1,1 etc

The output of the code returns:

(0, 1) 0.659449892453
(1, 0) 0.659449892453
(2, 3) 0.657024790285
(2, 4) 0.642297230697
(3, 2) 0.657024790285
(3, 4) 0.109675332432
(4, 2) 0.642297230697
(4, 3) 0.109675332432

I am trying to return the number of indices that are all matched following the condition above. In other words, to rearrange the output to:

output = [No. matched entries]

i.e.

output = [2, 3]

2 coming from the fact that indices 0 and 1 are matched

3 coming from the fact that indices 2, 3, and 4 are all matched

A possible way I have thought of is to append to a list, all the indices used such that I return

output_list = [0, 1, 1, 0, 2, 3, 4, 3, 2, 4, 4, 2, 3]

Then, I use defaultdict to count the occurrances:

for index in output_list:
    hits[index] += 1

From the dict I can manipulate it to return [2,3] but is there a more pythonic way of achieving this?

domsmiff
  • 57
  • 2
  • 6
  • In *this* example, the condition partitions the list of indices, but I don't see why that should always be the case. Distance-based measures of similarity are usually not transitive (if x is close to y and y is close to z then it need not be the case that x is close to z). But, if it isn't transitive, it really isn't clear what your output is supposed to be, since you seem to want the sizes of the cells in a partition. – John Coleman Oct 03 '16 at 11:24
  • In this case `deltaR(0,1)` and `deltaR(1,0)` yield the same value so it should be considered once. Therefore, I am not seeking the reversed pair, only the number of indices that satisfy the above requirement i.e. 2 – domsmiff Oct 03 '16 at 11:31

1 Answers1

2

This is finding connected components of a graph, which is very easy and well documented, once you revisit the problem from that view.

The data being in two lists is a distraction. I am going to consider the data to be zip(e_list, p_list). Consider this as a graph, which in this case has 5 nodes (but could have many more on a different data set). Construct the graph using these nodes, and connected them with an edge if they pass your distance test.

From there, you only need to determine the connected components of an undirected graph, which is covered on many many places. Here is a basic depth first search on this site: Find connected components in a graph

You loop through the nodes once, performing a DFS to find all connected nodes. Once you look at a node, mark it visited, so it does not get counted again. To get the answer in the format you want, simply count the number of unvisited nodes found from each unvisited starting point, and append that to a list.

------------------------ graph theory ----------------------

You have data points that you want to break down into related groups. This is a topic in both mathematics and computer science known as graph theory. see: https://en.wikipedia.org/wiki/Graph_theory

You have data points. Imagine drawing them in eta phi space as rectangular coordinates, and then draw lines between the points that are close to each other. You now have a "graph" with vertices and edges.

To determine which of these dots have lines between them is finding connected components. Obviously it's easy to see, but if you have thousands of points, and you want a computer to find the connected components quickly, you use graph theory.

Suppose I make a list of all the eta phi points with zip(e_list, p_list), and each entry in the list is a vertex. If you store the graph in "adjacency list" format, then each vertex will also have a list of the outgoing edges which connect it to another vertex.

Finding a connected component is literally as easy as looking at each vertex, putting a checkmark by it, and then following every line to the next vertex and putting a checkmark there, until you can't find anything else connected. Now find the next vertex without a checkmark, and repeat for the next connected component.

As a programmer, you know that writing your own data structures for common problems is a bad idea when you can use published and reviewed code to handle the task. Google "python graph module". One example mentioned in comments is "pip install networkx". If you build the graph in networkx, you can get the connected components as a list of lists, then take the len of each to get the format you want: [len(_) for _ in nx.connected_components(G)]

---------------- code -------------------

But if you don't understand the math, then you might not understand a module for graphs, nor a base python implementation, but it's pretty easy if you just look at some of those links. Basically dots and lines, but pretty useful when you apply the concepts, as you can see with your problem being nothing but a very simple graph theory problem in disguise.

My graph is a basic list here, so the vertices don't actually have names. They are identified by their list index.

e_list = [-0.619489,-0.465505, 0.124281, -0.498212, -0.51]      
p_list = [-1.7836,-1.14238, 1.73884, 1.94904, 1.84]  

def deltaR(e1, p1, e2, p2):
    de = e1 - e2                                                                                                                                                                     
    dp = p1 - p2                                                                                                                                                           
    return de*de + dp*dp

X = 1 # you never actually said, but this works

def these_two_particles_are_going_the_same_direction(p1, p2):
    return deltaR(p1.eta, p1.phi, p2.eta, p2.phi) < X

class Vertex(object):
    def __init__(self, eta, phi):
        self.eta = eta
        self.phi = phi
        self.connected = []
        self.visited = False

class Graph(object):
    def __init__(self, e_list, p_list):
        self.vertices = []
        for eta, phi in zip(e_list, p_list):
            self.add_node(eta, phi)

    def add_node(self, eta, phi):
        # add this data point at the next available index
        n = len(self.vertices)
        a = Vertex(eta, phi)

        for i, b in enumerate(self.vertices):
            if these_two_particles_are_going_the_same_direction(a,b):
                b.connected.append(n)
                a.connected.append(i)

        self.vertices.append(a)

    def reset_visited(self):
        for v in self.nodes:
            v.visited = False

    def DFS(self, n):
        #perform depth first search from node n, return count of connected vertices
        count = 0
        v = self.vertices[n]
        if not v.visited:
            v.visited = True
            count += 1
            for i in v.connected:
                count += self.DFS(i)
        return count

    def connected_components(self):
        self.reset_visited()
        components = []
        for i, v in enumerate(self.vertices):
            if not v.visited:
                components.append(self.DFS(i))                
        return components

g = Graph(e_list, p_list)
print g.connected_components()
Community
  • 1
  • 1
Kenny Ostrom
  • 5,639
  • 2
  • 21
  • 30
  • or use [networkx](https://networkx.github.io/) and do not reinvent a wheel. – Łukasz Rogalski Oct 03 '16 at 14:14
  • I'll add the networkx call for this problem, although building the graph, while easy, will still be the hardest part, whether you use a networkx graph or a simple list/dict implementation. – Kenny Ostrom Oct 03 '16 at 14:33
  • Thank you for your suggestion, I have followed your advice. However, I am still seeking an explanation that doesn't rely on using a non-built in module. – domsmiff Oct 03 '16 at 18:56
  • graph theory isn't a module you download, it's math. But I added an implementation of your problem in basic python 2.7 with no import. It is probably smart to download a module though, if you're not in the middle of taking an algorithms class. – Kenny Ostrom Oct 03 '16 at 21:48
  • You wanted more explanation, which I added. If that wall of text didn't help, maybe you could comment on what I didn't explain, rather than have me ramble on about a broad topic. – Kenny Ostrom Oct 05 '16 at 14:44