How to parse JSON-XML hybrid file in Python

Question

I would like to parse a file with the following syntax (but with an indefinite number of nesting) with Python:

<XProtocol>
{
    <str1."fds"> "str2"
    <str3> 123.0
    <str4> { 1 2 3 4 5 6 6 "str" "str" 43 "str" 4543 }
    <weird1."str5">
    {
        <weird2."str6"> { "str" }
        <also."weird3"> 1
        <againweird> { 1 "fds" }
        { }
        <even> <more."weird4"> { } { } { } { "a" }
    }
}

The desidered output would be something like:

'XProtocol':
{
    'str1."fds"': 'str2',
    'str3': 123.0,
    'str4': (1, 2, 3, 4, 5, 6, 6, 'str', 'str', 43, 'str', 4543),
    'weird1."str5"':
    {
        'weird2."str6"': ( 'str' ),
        'also."weird3"': 1,
        'againweird': ((1, 'fds'), None),
        'even': { 'more."weird4"': (None, None, None, 'a') },
    }
}

I have unsuccessfully tried using the following code:

import pyparsing as pp

def parse_x_prot(text):        
    lbra = pp.Literal('{').suppress()
    rbra = pp.Literal('}').suppress()
    lang = pp.Literal('<').suppress()
    rang = pp.Literal('>').suppress()
    dot = pp.Literal('.').suppress()
    cstr = pp.quotedString.addParseAction(pp.removeQuotes)
    tag = pp.Group(
        lang +
        pp.Word(pp.alphanums) +
        pp.Optional(pp.Group(dot + cstr)) +
        rang)
    val = pp.OneOrMore(
        cstr | pp.Word(pp.nums + '.')
    )
    exp = pp.Forward()
    exp << pp.OneOrMore(
        pp.Group(
            tag + pp.OneOrMore(
                (lbra + (val | exp) + rbra) |
                (val + exp)
            )
        )
    )
    return exp.parseString(text)

I must be doing something wrong, but haven't yet figured out exactly what... just to be more precise: the following code tells me it expects a '}' instead of a new 'tag'.

well.. it follows a similar semantic, except that 'dictionary' names are enclosed with '<', '>', lists are not comma separated, and the like.. XML would be much farther in comparison — norok2, Feb 18 '16 at 16:05
It has *elements* from a bunch of other serialisation formats, but it's equally distant from JSON as it is from XML I'd say... — deceze, Feb 18 '16 at 16:07
That is a jolly good question, but I do not have a precise answer (I am not the inventor of it - I do not want to get more credit than I deserve, and I fail to see why this format was invented in the first place).. anyway I would assume it is like ` { { } }, except that the first pair of brackets is redundant (!?) — norok2, Feb 18 '16 at 16:14
I have edited the title, so that readers are expecting less from JSON syntax — norok2, Feb 18 '16 at 16:18
It would probably be a useful first step to understand the data format you're trying to parse before you try to parse it... :o) — deceze, Feb 18 '16 at 16:21
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/103870/discussion-between-norok2-and-deceze). — norok2, Feb 18 '16 at 16:35

score 3 · Answer 1 · answered Feb 18 '16 at 17:18

A couple of things:

In your definition of tag, you wrap it in a Group, but I think you really want to use Combine.

The second thing, your nesting in exp mixes up the repetition with the recursion.

This works for me (also, take of the .suppress() on dot):

tag = pp.Combine(
    lang +
    pp.Word(pp.alphas, pp.alphanums) +
    pp.Optional(dot + cstr) +
    rang).setName("tag")

exp = pp.Forward()
key_value = pp.Group(tag + exp)
number = pp.Regex(r'[+-]?\d+(\.\d*)?').setName("number")
exp <<= (number |
            cstr |
            key_value |
            pp.Group(lbra + pp.ZeroOrMore(exp) + rbra))

Giving:

['XProtocol', [['str1.fds', 'str2'], ['str3', '123.0'], ...
[0]:
  XProtocol
[1]:
  [['str1.fds', 'str2'], ['str3', '123.0'], ['str4', ['1', '2', '3',...
  [0]:
    ['str1.fds', 'str2']
  [1]:
    ['str3', '123.0']
  [2]:
    ['str4', ['1', '2', '3', '4', '5', '6', '6', 'str', 'str', '43', ...
    [0]:
      str4
    [1]:
      ['1', '2', '3', '4', '5', '6', '6', 'str', 'str', '43', ...
  [3]:
    ['weird1.str5', [['weird2.str6', ['str']], ['also.weird3', ...
    [0]:
      weird1.str5
    [1]:
      [['weird2.str6', ['str']], ['also.weird3', '1'], ['againweird', ...
      [0]:
        ['weird2.str6', ['str']]
        [0]:
          weird2.str6
        [1]:
          ['str']
      [1]:
        ['also.weird3', '1']
      [2]:
        ['againweird', ['1', 'fds']]
        [0]:
          againweird
        [1]:
          ['1', 'fds']
      [3]:
        []
      [4]:
        ['even', ['more.weird4', []]]
        [0]:
          even
        [1]:
          ['more.weird4', []]
          [0]:
            more.weird4
          [1]:
            []
      [5]:
        []
      [6]:
        []
      [7]:
        ['a']

Thank you very much for the hints, they helped me really a lot and your code actually work on most of real-world scenarios. However it looks like it still does not produce the correct results (e.g. the empty items at the end of the output should all be at the same level as `more.weird4`). I will play around with it and see what happens. — norok2, Feb 18 '16 at 19:42

score 3 · Answer 2 · answered Feb 18 '16 at 17:25

I changed a few things in your code to make it work correctly, I used the comments to indicate what went wrong.

def parse_x_prot(text):
    # Set up some shortcuts
    lbra = pp.Literal('{').suppress()
    rbra = pp.Literal('}').suppress()
    lang = pp.Literal('<').suppress()
    rang = pp.Literal('>').suppress()
    dot = pp.Literal('.')
    cstr = pp.quotedString.addParseAction(pp.removeQuotes)

    # Define how a correct tag looks like (we use combine here to get the full tag in the output)
    tag = pp.Combine(
        lang +
        pp.Word(pp.alphanums) +
        pp.Optional(pp.Group(dot + pp.quotedString)) +
        rang)

    # Define legal value (first combine is for decimal values)
    val = pp.Combine(pp.Word(pp.nums) + dot + pp.Word(pp.nums)) | cstr | pp.Word(pp.nums)

    # Define the array with statement as recursion element
    statement = pp.Forward()
    array = pp.Group(pp.OneOrMore(tag) +
                     pp.OneOrMore(
                         (
                             # Note the one or more function here as we have 
                             # a kind of list here which will have elements
                             (lbra + pp.OneOrMore(val | statement) + rbra) |
                             val |
                             (lbra + rbra)
                         )
                     )
                     )

    statement << array
    return statement.parseString(text)

Thank you for the comments, they were very helpful.. unfortunately the above code does not seem to work with more complicated examples.. I will come back to it as soon as I feel closer to understand why and/or I can produce a simple test case — norok2, Feb 18 '16 at 17:52
Just let me know, would be interesting to see a even weirder form of this hybrid :) — B8vrede, Feb 18 '16 at 18:13

score 0 · Answer 3 · edited Nov 20 '17 at 15:51

0

This may not be the answer you want, but I think Flex would help you greatly with this kind of task. There might even be a python wrapper for it

edited Nov 20 '17 at 15:51

rici

234,347
28
237
341

answered Feb 18 '16 at 17:00

Alex B.

133
1
6

I would not argue against the fact that there is a LOT out there for text parsing - some of them extremely performant, but when it comes to ease of use and interface to Python, I found `pyparsing` to be actually an excellent choice. – norok2 Feb 18 '16 at 18:26

How to parse JSON-XML hybrid file in Python

3 Answers3