Merge several graphml-files with networkx and remove duplicates

Question

I'm new to programming, Python and networkx (ouch!) and trying to merge four graphml-files into one and removing the duplicate nodes, following the excellent instructions here

However, I can't figure out how to keep track of the duplicate nodes when there are FOUR files to compare, instead of two. The code I've written below won't work, but you can hopefully see how I'm thinking wrong and help me.

# script to merge one or more graphml files into one single graphml file

# First read graphml-files into Python and Networkx (duplicate variables as necessary)
A = nx.read_graphml("file1.graphml")
B = nx.read_graphml("file2.graphml")
C = nx.read_graphml("file3.graphml")
D = nx.read_graphml("file4.graphml")

# Create a new graph variable containing all the previous graphs
H = nx.union(A,B,C,D, rename=('1-','2-','3-','4-'))

# Check what nodes are in two or more of the original graphml-files
duplicate_nodes_a_b = [n for n in A if n in B]
duplicate_nodes_b_c = [n for n in B if n in C]
duplicate_nodes_c_d = [n for n in C if n in D]

all_duplicate_nodes = # How should I get this?

# remove duplicate nodes
for n in all_duplicate nodes:
     n1='1-'+str(n)
     n2='2-'+str(n)
     n3='3-'+str(n)
     n4='4-'+str(n)
     H.add_edges_from([(n1,nbr)for nbr in H[n2]]) # How can I take care of duplicates_nodes_b_c, duplicates_nodes_c_d?
     H.remove_node(n2)

 # write the merged graphml files-variable into a new merged graphml file
 nx.write.graphml(H, "merged_file.graphml", encoding="utf-8", prettyprint=True)

If you have a working way to merge 2, you could apply it pairwise, then to the merged graphs. As for merging duplicate lists: `duplicates_a_b = set(n for n in A if n in B)` etc.., then `all_duplicates = duplicates_a_b | duplicates_b_c | ...`. — drevicko, Jul 06 '13 at 09:36

score 2 · Answer 1 · answered Jun 19 '13 at 14:02

First, note that the way you use nx.union is not what you want. You really need to call it with just two graphs. But how to deal with the duplicates gets complicated this way, because you have to consider all possible pairs of graphs to see how a node could be duplicated.

Instead, let's be more direct and just count up in how many graphs each node appears. This is easy using a Counter:

import collections
ctr = collections.Counter()
for G in [A, B, C, D]:
    ctr.update(G)

Now determine which nodes just appear once, using the counter:

singles = {x for (x,n) in ctr.viewitems() if n == 1}

With that set of nodes, we can then compute the subgraphs containing only nodes that are not duplicated:

E = nx.union(A.subgraph(singles), B.subgraph(singles))
F = nx.union(C.subgraph(singles), D.subgraph(singles))
H = nx.union(E, F)

The graph H has all four initial graphs merged with duplicates removed.

The approach I've shown makes several intermediate graphs, so it is possible that, for large input graphs, you'll run into memory problems. If so, a similar approach could be done where you determine the set of duplicated nodes, delete those nodes from the original graphs, and then find the union without keeping all the intermediates. It looks like:

import collections
import networkx as nx

ctr = collections.Counter()
for G in [A, B, C, D]:
    ctr.update(G)

duplicates = {x for (x,n) in ctr.viewitems() if n > 1}
H = nx.Graph() 
for G in [A, B, C, D]:
    G.remove_nodes_from(duplicates) # change graphs in-place
    H = nx.union(H, G)

Both approaches take advantage of the way that NetworkX functions often allow extra nodes to be given and silently ignored.

There is something funny with the different approaches. Both ends up with the same amount of nodes, but there is a difference between the number of edges left in the merged graphml-file. — mattiasostmar, Jun 20 '13 at 08:53
I also tried with four different graphml-files made up of tweets from totally different searches on twitter. The merged file ended up with fewer nodes than any of the files. There must be something wrong. — mattiasostmar, Jun 20 '13 at 09:09
Do you have duplicated nodes in the graphml files? (what does `nx` do with such situations? Perhaps `nx` is merging the nodes? You could do some tests...). What type of objects are your nodes? are they the same in all your tests? (this could change the way equality is measured and thus which are duplicates) — drevicko, Jul 06 '13 at 10:01

score 1 · Answer 2 · answered Apr 20 '15 at 13:14

If the graphml files are simple (no weights, properties, etc.), then it may be easier to work at the text level. For instance,

cat A.graphml B.graphml C.graphml | sort -r | uniq > D.graphml

This will keep unique sets of nodes and edges from three graphml files. You can rearrange <graph>, </graph>, <graphml ...>, </graphml> tags in D.graphml later with a text editor.

Merge several graphml-files with networkx and remove duplicates

2 Answers2