Filter large graphml files by stripping parts

Question

I have a lot of graphml files. Within a directory, i have sub directories and inside them, about 3 thousand files each one. Without visualizing them, that is, directly opening them with python, its structure goes something like this:

 <?xml version='1.0' encoding='utf-8'?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <key id="d0" for="node" attr.name="block" attr.type="string" />
  <graph edgedefault="undirected">
    <node id="0">
      <data key="d0">0000000000000</data>
    </node>
    <node id="1">
      <data key="d0">0200000000000</data>
    </node>
    <node id="2">
      <data key="d0">1900000000000</data>
    </node>

and so on. I want to filter each one of these nodes by its geographical id, which are the two first digits after <data key='d0'>. How can I do this? Each file has the same header, of course, and this should be done for like 300 000 files. Of course, this is going to alter the network, and very probably networkx won't display it fully, but this is just a test.

Of course, with a for loop can be done (after stripping the part i'm interested in, and I dont know about this), but I'm intrigued being a graphml file can I directly do it with jupyter?

Filter large graphml files by stripping parts

0 Answers0