tree = etree.fromstring(bytes(xml, encoding='utf-8')) TypeError: encoding without a string argument

Question

I would like to write a function which allows me to extract the value of the attribute "fmc" and the text inside the part "tag". I will like to use a regex solution.

<?xml version = "1.0" encoding="UTF-8" standalone="yes" ?>
<corpus>
    <ver id="18" etude="EC1_Elec" elec="oui" niveau="1" critere="1.3" type="discours">
        <part code="EC1_Elec_IW04_0">Ça existe sur des gros parcs Hlm mais c'est macro.</part>
    </ver>
    <ver id="30" etude="EC1_Elec" elec="oui" niveau="2" critere="" origine="IW" type="discours" fmc="motives">
        <part code="EC1_Elec_IW01_0">Avant 75 on n'a pas isolé puis après, au fur et à mesure des règlementations.</part>
    </ver>
    <ver id="54" etude="EC1_Elec" elec="oui" niveau="1" critere="" origine="IW" type="discours" fmc="condition">
        <part code="EC1_Elec_IW10_0">Le deuxième boitier, il est où ? s'il y en a un qui est à l'intérieur et qui remplace un bout de l'isolation, il est caché OK.</part>
    </ver>
    <ver id="897" etude="EC3_Elec" elec="oui" niveau="4" critere="4.1" origine="TR" type="discours" fmc="obstacle">
        <part code="EC3_Elec_TR2_1">Avec l'économie d'énergie, on va imposer de plus en plus d'automatismes,</part>
        <iwer>Çava influencer la demande pour ce type de solution c'est ça ?</iwer>
        <part code="EC3_Elec_TR2_1">Je pense oui</part>
    </ver>
</corpus>

So I have modify this function to suit my data according to the answers above

code

def review_extractor(xml, category='verbatim', do_lower=False):
    """
    Extract review and label
    """
    # use lxml...

    # parse the xml snippet into an object tree
    tree = etree.fromstring(bytes(xml, encoding='utf-8'))
    # find all elements that have "fmc" attribute
    for e in tree.findall(".//*[@fmc]"):
        label = e.xpath("./@fmc")[0]
        for c in e.getchildren("./part"):
            # print value of "fmc" attribute and text of child element
            print(f"{label:15}{c.text}")
            # 
        return label, c.text

So For my example, the function should return this (review before label):

Label      review_text
motivation Avant 75 on n'a pas isolé puis après, au fur et à mesure des règlementations.

condition Le deuxième boitier, il est où ? s'il y en a un qui est à l'intérieur et qui remplace un bout de l'isolation, il est caché OK.

obstacle Avec l'économie d'énergie, on va imposer de plus en plus d'automatismes,

obstacle Je pense oui

Can you clarify what your question ? Why use only regex for this? — AMC, Feb 13 '20 at 17:58
It is a script for splitting data and I have to improve it for my data that's why and because I use the function in others functions to split the data in train, test and val. I put the rest of code above — kely789456123, Feb 13 '20 at 18:02
@kely789456123 I’m not sure how that relates to my comment, can you elaborate? — AMC, Feb 13 '20 at 18:04
Hi @kely789456123 you should not completely edit your question after people have spent time answering it. If you have another question please ask a new one! — tomjn, Feb 13 '20 at 20:42
Never use regular expressions to process XML. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Michael Kay, Feb 13 '20 at 21:10

tomjn · Answer 1 · 2020-02-13T18:29:30.860

I realise you explicitly asked for a regex solution, but as an alternative here is one using one of pythons built in xml parsers, specifically xml.etree.ElementTree.

xml_string = """<?xml version = "1.0" encoding="UTF-8" standalone="yes" ?>
<corpus>
    <ver id="18" etude="EC1_Elec" elec="oui" niveau="1" critere="1.3" type="discours">
        <part code="EC1_Elec_IW04_0">Ça existe sur des gros parcs Hlm mais c'est macro.</part>
    </ver>
    <ver id="30" etude="EC1_Elec" elec="oui" niveau="2" critere="" origine="IW" type="discours" fmc="motives">
        <part code="EC1_Elec_IW01_0">Avant 75 on n'a pas isolé puis après, au fur et à mesure des règlementations.</part>
    </ver>
    <ver id="54" etude="EC1_Elec" elec="oui" niveau="1" critere="" origine="IW" type="discours" fmc="condition">
        <part code="EC1_Elec_IW10_0">Le deuxième boitier, il est où ? s'il y en a un qui est à l'intérieur et qui remplace un bout de l'isolation, il est caché OK.</part>
    </ver>
    <ver id="897" etude="EC3_Elec" elec="oui" niveau="4" critere="4.1" origine="TR" type="discours" fmc="obstacle">
        <part code="EC3_Elec_TR2_1">Avec l'économie d'énergie, on va imposer de plus en plus d'automatismes,</part>
        <iwer>Çava influencer la demande pour ce type de solution c'est ça ?</iwer>
        <part code="EC3_Elec_TR2_1">Je pense oui</part>
    </ver>
</corpus>"""

import xml.etree.ElementTree as ET
tree = ET.fromstring(xml_string)

for i in tree.findall('ver'):
    fmc = i.attrib.get("fmc")
    if fmc is None:
        continue
    for p in i.findall("part"):
        print(fmc, p.text)

The output is

motives Avant 75 on n'a pas isolé puis après, au fur et à mesure des règlementations.
condition Le deuxième boitier, il est où ? s'il y en a un qui est à l'intérieur et qui remplace un bout de l'isolation, il est caché OK.
obstacle Avec l'économie d'énergie, on va imposer de plus en plus d'automatismes,
obstacle Je pense oui

if you want to use xpath expressions, you can simplify it slightly further

for i in tree.findall('ver[@fmc]'):
    for p in i.findall('part'):
        print(i.attrib['fmc'], p.text)

score 0 · Answer 2 · answered Feb 13 '20 at 18:07

0

Try this one /fmc(.*?)+=(.*?)+\"(.+?)\"/g

answered Feb 13 '20 at 18:07

Judson Cruz

11
2

score 0 · Answer 3 · answered Feb 13 '20 at 18:08

You could try the regex:

<([a-zA-Z0-9]+)[^\/]*?fmc=([\'\"])(.*?)\2.*?>[\s\n\r]*<([a-zA-Z0-9]+).*?>(.*?)</\4>

As seen here.

The complete code looks like this:

import re

f = """<corpus>
    <ver id="18" etude="EC1_Elec" elec="oui" niveau="1" critere="1.3" type="discours">
        <part code="EC1_Elec_IW04_0">Ça existe sur des gros parcs Hlm mais c'est macro.</part>
    </ver>
    <ver id="30" etude="EC1_Elec" elec="oui" niveau="2" critere="" origine="IW" type="discours" fmc="motivation">
        <part code="EC1_Elec_IW01_0">Avant 75 on n'a pas isolé puis après, au fur et à mesure des règlementations.</part>
    </ver>
    <ver id="54" etude="EC1_Elec" elec="oui" niveau="1" critere="" origine="IW" type="discours" fmc="condition">
        <part code="EC1_Elec_IW10_0">Le deuxième boitier, il est où ? s'il y en a un qui est à l'intérieur et qui remplace un bout de l'isolation, il est caché OK.</part>
    </ver>
    <ver id="897" etude="EC3_Elec" elec="oui" niveau="4" critere="4.1" origine="TR" type="discours">
        <part code="EC3_Elec_TR2_1">Avec l'économie d'énergie, on va imposer de plus en plus d'automatismes,</part>
        <iwer>Çava influencer la demande pour ce type de solution c'est ça ?</iwer>
        <part code="EC3_Elec_TR2_1">Je pense oui</part>
    </ver>
</corpus>"""

regex = r'<([a-zA-Z0-9]+)[^\/]*?fmc=([\'\"])(.*?)\2.*?>[\s\n\r]*<([a-zA-Z0-9]+).*?>(.*?)</\4>'

matches = re.findall(regex, f)

for x in matches:
    print(x[2] + " " + x[4])

score 0 · Answer 4 · answered Feb 13 '20 at 18:10

0

Regex is really the wrong solution for this but this could work:

fmc="(.*?)".*?<part.*?>(.*?)</part>

https://regex101.com/r/M7LJLU/1

And you desired result will be in \1 and \2.

answered Feb 13 '20 at 18:10

MonkeyZeus

20,375
4
36
77

If you have another solution feel free to propose it I will try anything at this point – kely789456123 Feb 13 '20 at 18:17
1

@kely789456123 you seem insistent on using regex so I provided a working solution. If you want the proper solution then look into [`xml.etree.ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html) and learn XPath – MonkeyZeus Feb 13 '20 at 18:19

Booboo · Answer 5 · 2020-02-13T19:08:39.783

This is job better suited to an XML parser. I use untangle from the PyPI repository:

import untangle

xml = """<?xml version = "1.0" encoding="UTF-8" standalone="yes" ?>
<corpus>
    <ver id="18" etude="EC1_Elec" elec="oui" niveau="1" critere="1.3" type="discours">
        <part code="EC1_Elec_IW04_0">Ça existe sur des gros parcs Hlm mais c'est macro.</part>
    </ver>
    <ver id="30" etude="EC1_Elec" elec="oui" niveau="2" critere="" origine="IW" type="discours" fmc="motives">
        <part code="EC1_Elec_IW01_0">Avant 75 on n'a pas isolé puis après, au fur et à mesure des règlementations.</part>
    </ver>
    <ver id="54" etude="EC1_Elec" elec="oui" niveau="1" critere="" origine="IW" type="discours" fmc="condition">
        <part code="EC1_Elec_IW10_0">Le deuxième boitier, il est où ? s'il y en a un qui est à l'intérieur et qui remplace un bout de l'isolation, il est caché OK.</part>
    </ver>
    <ver id="897" etude="EC3_Elec" elec="oui" niveau="4" critere="4.1" origine="TR" type="discours" fmc="obstacle">
        <part code="EC3_Elec_TR2_1">Avec l'économie d'énergie, on va imposer de plus en plus d'automatismes,</part>
        <iwer>Çava influencer la demande pour ce type de solution c'est ça ?</iwer>
        <part code="EC3_Elec_TR2_1">Je pense oui</part>
    </ver>
</corpus>
"""

doc = untangle.parse(xml)
for ver in doc.corpus.ver:
    if ver['fmc'] is None: continue
    print(f"id={ver['id']}, fmc={ver['fmc']}")
    for part in ver.part:
        print(f"   part={part.cdata}")

Prints:

id=30, fmc=motives
   part=Avant 75 on n'a pas isolé puis après, au fur et à mesure des règlementations.
id=54, fmc=condition
   part=Le deuxième boitier, il est où ? s'il y en a un qui est à l'intérieur et qui remplace un bout de l'isolation, il est caché OK.
id=897, fmc=obstacle
   part=Avec l'économie d'énergie, on va imposer de plus en plus d'automatismes,
   part=Je pense oui

Thank but I just want to return those with fmw value and not none — kely789456123, Feb 13 '20 at 18:59
Sorry about that. That's a one-line change. I have made an update. — Booboo, Feb 13 '20 at 19:08

captnswing · Answer 6 · 2020-02-13T19:13:49.453

Why don't you use lxml to parse your XML? IMHO it's much easier to let lxml parse the xml and navigate the resulting element tree using e.g. XPath to find the things you want.

# install lxml
pip3 install lxml

# xml snippet
xml = """\
<?xml version = "1.0" encoding="UTF-8" standalone="yes" ?>
<corpus>
    <ver id="18" etude="EC1_Elec" elec="oui" niveau="1" critere="1.3" type="discours">
        <part code="EC1_Elec_IW04_0">Ça existe sur des gros parcs Hlm mais c'est macro.</part>
    </ver>
    <ver id="30" etude="EC1_Elec" elec="oui" niveau="2" critere="" origine="IW" type="discours" fmc="motives">
        <part code="EC1_Elec_IW01_0">Avant 75 on n'a pas isolé puis après, au fur et à mesure des règlementations.</part>
    </ver>
    <ver id="54" etude="EC1_Elec" elec="oui" niveau="1" critere="" origine="IW" type="discours" fmc="condition">
        <part code="EC1_Elec_IW10_0">Le deuxième boitier, il est où ? s'il y en a un qui est à l'intérieur et qui remplace un bout de l'isolation, il est caché OK.</part>
    </ver>
    <ver id="897" etude="EC3_Elec" elec="oui" niveau="4" critere="4.1" origine="TR" type="discours" fmc="obstacle">
        <part code="EC3_Elec_TR2_1">Avec l'économie d'énergie, on va imposer de plus en plus d'automatismes,</part>
        <iwer>Çava influencer la demande pour ce type de solution c'est ça ?</iwer>
        <part code="EC3_Elec_TR2_1">Je pense oui</part>
    </ver>
</corpus>
"""

Then this is the code that does the trick!

# use lxml...
from lxml import etree
# parse the xml snippet into an object tree
tree = etree.fromstring(bytes(xml, encoding='utf-8'))
# find all elements that have "fmc" attribute
for e in tree.findall(".//*[@fmc]"):
    label = e.xpath("./@fmc")[0]
    for c in e.findall("./part"):
        # print value of "fmc" attribute and text of all <part> elements
        print(f"{label:15}{c.text}")

Output:

motives        Avant 75 on n'a pas isolé puis après, au fur et à mesure des règlementations.
condition      Le deuxième boitier, il est où ? s'il y en a un qui est à l'intérieur et qui remplace un bout de l'isolation, il est caché OK.
obstacle       Avec l'économie d'énergie, on va imposer de plus en plus d'automatismes,
obstacle       Je pense oui

thank you for your suggestion. but I do not want text in tag — kely789456123, Feb 13 '20 at 19:08
hello what does {label:15} correspond to can I rewrite it like this in a function : return label:15 or it will false ? — kely789456123, Feb 13 '20 at 19:41
It just pads spaces to the end up to 15 chars. you can just `print(label, c.text)` — captnswing, Feb 13 '20 at 21:49

tree = etree.fromstring(bytes(xml, encoding='utf-8')) TypeError: encoding without a string argument

6 Answers6