extracting information from xml file in python

Question

I want to extract information from a couple of xml file as here:

https://github.com/peldszus/arg-microtexts/blob/master/corpus/en/micro_b001.xml

I want to only extract this tag information:

<arggraph id="micro_b001" topic_id="waste_separation" stance="pro">

which is : "micro_b001" "waste_separation"

I want to save them as list

I have tried this:

myList = []  
myEdgesList=[]
#read the whole text from 
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            with open(os.path.join(root, file), encoding="UTF-8") as content:
                tree = ET.parse(content)
                myList.append(tree)

this above code is correct it give information of each file as

<xml.etree.ElementTree.ElementTree at 0x21c893e34c0>,

but this looks not correct

for k in myList:
    arg= [e.attrib['stance'] for e in k.findall('.//arggraph')  ]
    print(arg)

the second code doesn't give the required value to me

Have you tried the solutions mentioned [here](https://stackoverflow.com/questions/9797274/find-xml-element-based-on-its-attribute-and-change-its-value)? — Stefano Frazzetto, Oct 21 '20 at 21:22
it is kind of different. here i need information in first tag — Moha, Oct 21 '20 at 21:24

score 0 · Answer 1 · answered Oct 21 '20 at 22:03

0

One way to handle this:

from lxml import etree
tree = etree.parse(myfile.xml)
for graph in tree.xpath('//arggraph'):
    print(graph.xpath('@id')[0])
    print(graph.xpath('@topic_id')[0])

Output:

micro_b001
waste_separation

answered Oct 21 '20 at 22:03

Jack Fleeting

24,385
6
23
45

score 0 · Answer 2 · answered Oct 26 '20 at 01:37

Another method.

import os
from simplified_scrapy import SimplifiedDoc, utils

path = 'test'
#read the whole text from 
myList = []
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            myList.append(os.path.join(root, file))

for file in myList:
    xml = utils.getFileContent(file)
    doc = SimplifiedDoc(xml)
    arg = [(e['stance'],e['id'],e['topic_id']) for e in doc.selects('arggraph')]
    print (arg)

Result:

[('pro', 'micro_b001', 'waste_separation')]

extracting information from xml file in python

2 Answers2