Converting a weirdly formatted XML file to CSV using python

Question

I have this weird XML document that contains Phone number details, I need to export this into a CSV document but the problem is it's not formatted correctly. All of the elements are inside of </ string> tags and some "Name" fields are repeated but not in the exact same way (like in the example below, most repeated lines contain extra spaces or commas). And all the "Numbers" are indented from the "Name" fields.

        <string>example1</string>
            <string>014584111</string>

        <string>example2</string>
            <string>04561212123</string>

        <string>example3</string>
            <string>+1 156151561</string>

        <string>example4</string>
            <string>564513212</string>
        
        <string>example3, </string>
        <string>example4  </string>

How can I convert this into a CSV format without the repeated content using python? Here's an example output

FullName  PhoneNumber
  
example1  014584111
example2  014584111    
example3  +1 156151561  
example4  564513212

if this `example3, ` comes before this `example3 +1 156151561` then what is the output? how to identify the duplicates if there is no specific pattern? — deadshot, Sep 01 '20 at 11:18

bitranox · Answer 1 · 2020-09-02T13:48:08.237

Of course, this can be done. If You can describe the process in human language, You also can program it.

Example :

read the file (line by line ? or does the file fit into memory ? )
strip off <string> and </string>
- is the line intended ? --> No --> It is a key
- is the line intended ? --> Yes --> It is a value to the last key
add the results to a dict
write the dict to a .csv file

So - You need now to make some decisions like :

Is the import file huge ? Then it will probably not fit into the memory, and we need to process line by line. Or will it fit in memory ?

Will this program be needed many times ? Or is it just a one-time conversion ?

Then You can divide the problems in smaller sub problems, and write some tests for each sub-problem.

You need also consider more circumstances like file size, if is it a one-time script, if there should be error checking (what if there are two intended lines ?) etc.

balderman · Answer 2 · 2020-09-02T13:10:33.093

-1

below (do what you need to do with data)

import xml.etree.ElementTree as ET

def is_phone_number(value):
    for x in value:
        if x != '+' and x != ' ' and not x.isnumeric():
            return False
    return True
    
xml = '''<r> <string>example1</string>
            <string>014584111</string>

        <string>example2</string>
            <string>04561212123</string>

        <string>example3</string>
            <string>+1 156151561</string>

        <string>example4</string>
            <string>564513212</string>
        
        <string>example3, </string>
        <string>example4  </string></r>'''
data = []
root = ET.fromstring(xml)
strings = root.findall('.//string')
i = 0
while i < len(strings):
    if is_phone_number(strings[i+1].text):
        data.append({'key': strings[i].text,'value':strings[i+1].text})
    i += 2

print(data)

output

[{'key': 'example1', 'value': '014584111'}, {'key': 'example2', 'value': '04561212123'}, {'key': 'example3', 'value': '+1 156151561'}, {'key': 'example4', 'value': '564513212'}]

edited Sep 02 '20 at 13:10

answered Sep 01 '20 at 11:47

balderman

22,927
7
34
52

@bitranox look at the XML input (last record) and see that the result is correct. Feel free to up vote this answer :-) – balderman Sep 01 '20 at 14:27
look at the data - it is malformed XML, and the values are indented. stated in the question: all the "Numbers" are indented from the "Name" fields. By using a normal XML parser, that information is lost - so You need to parse it manually. The last two lines should not be in the results, since they dont have numbers assigned (this is stated in the result set of the question). Therefore Your answer is unfortunately wrong ... – bitranox Sep 02 '20 at 12:52
@bitranox My code just load the XML into list of dicts. From that point the OP can add logic that will solve this problem. All I did to the XML input is to add `` and ``. I did not change anything else! – balderman Sep 02 '20 at 12:56
again - Your transformation ignores the indention - numbers ARE indented in that dataset. That information is useful for further parsing, and You ignore it - and that is just not a good idea ! besides that You really dont need an XML parser to strip those markers . – bitranox Sep 02 '20 at 13:02
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/220865/discussion-between-bitranox-and-balderman). – bitranox Sep 02 '20 at 13:39

Converting a weirdly formatted XML file to CSV using python

2 Answers2