-1

I want to extract information from a couple of xml file as here: enter image description here

https://github.com/peldszus/arg-microtexts/blob/master/corpus/en/micro_b001.xml

I want to only extract this tag information:

<arggraph id="micro_b001" topic_id="waste_separation" stance="pro">

which is : "micro_b001" "waste_separation"

I want to save them as list

I have tried this:

myList = []  
myEdgesList=[]
#read the whole text from 
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            with open(os.path.join(root, file), encoding="UTF-8") as content:
                tree = ET.parse(content)
                myList.append(tree)

this above code is correct it give information of each file as

<xml.etree.ElementTree.ElementTree at 0x21c893e34c0>,

but this looks not correct

for k in myList:
    arg= [e.attrib['stance'] for e in k.findall('.//arggraph')  ]
    print(arg)

the second code doesn't give the required value to me

Moha
  • 85
  • 1
  • 9

2 Answers2

0

One way to handle this:

from lxml import etree
tree = etree.parse(myfile.xml)
for graph in tree.xpath('//arggraph'):
    print(graph.xpath('@id')[0])
    print(graph.xpath('@topic_id')[0])

Output:

micro_b001
waste_separation
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
0

Another method.

import os
from simplified_scrapy import SimplifiedDoc, utils

path = 'test'
#read the whole text from 
myList = []
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith('.xml'):
            myList.append(os.path.join(root, file))

for file in myList:
    xml = utils.getFileContent(file)
    doc = SimplifiedDoc(xml)
    arg = [(e['stance'],e['id'],e['topic_id']) for e in doc.selects('arggraph')]
    print (arg)

Result:

[('pro', 'micro_b001', 'waste_separation')]
dabingsou
  • 2,469
  • 1
  • 5
  • 8