Parsing colon delimited data

Question

I have the following text chunk:

string = """
    apples: 20
    oranges: 30
    ripe: yes
    farmers:
            elmer fudd
                   lives in tv
            farmer ted
                   lives close
            farmer bill
                   lives far
    selling: yes
    veggies:
            carrots
            potatoes
    """

I am trying to find a good regex that will allow me to parse out the key values. I can grab the single line key values with something like:

'(.+?):\s(.+?)\n'

However, the problem comes when I hit farmers, or veggies.

Using the re flags, I need to do something like:

re.findall( '(.+?):\s(.+?)\n', string, re.S),

However, I am having a heck of a time grabbing all of the values associated with farmers.

There is a newline after each value, and a tab, or series of tabs before the values when they are multiline.

and goal is to have something like:

{ 'apples': 20, 'farmers': ['elmer fudd', 'farmer ted'] }

etc.

Thank you in advance for your help.

Is the 'lives in tv' part significant? You didn't mention it in your desired output. — Burhan Khalid, Oct 15 '13 at 22:51
How about this approach: split by newlines store as `x`, step through each line, and split it by `':'`. If the second part is not empty, then add the two pairs as key and value to your dictionary, and pop the line from `x`; next you'll be left with a list of only keys (with ':') and everything else goes in a list for that key. Run through the trimmed `x` and add the remaining to the dictionary. — Burhan Khalid, Oct 15 '13 at 23:13
What's the rule for why `"lives in tv"` doesn't end up in the list? Or `"farmer bill"`, for that matter? — abarnert, Oct 15 '13 at 23:32

score 2 · Answer 1 · answered Oct 15 '13 at 22:46

2

You might look at PyYAML, this text is very close to, if not actually valid YAML.

answered Oct 15 '13 at 22:46

Wayne Werner

49,299
29
200
290

It's close, but I believe that `farmers` will end up one long string - it's not quite a list... – Jon Clements Oct 15 '13 at 22:50
If can grab the values, I could split by newline, and construct the list. However, trying to figure out how to best grab the values. – fr00z1 Oct 15 '13 at 23:01

score 1 · Answer 2 · answered Oct 15 '13 at 23:16

Here's a totally silly way to do it:

import collections


string = """
    apples: 20
    oranges: 30
    ripe: yes
    farmers:
            elmer fudd
                   lives in tv
            farmer ted
                   lives close
            farmer bill
                   lives far
    selling: yes
    veggies:
            carrots
            potatoes
    """


def funky_parse(inval):
    lines = inval.split("\n")
    items = collections.defaultdict(list)
    at_val = False
    key = ''
    val = ''
    last_indent = 0
    for j, line in enumerate(lines):
        indent = len(line) - len(line.lstrip())
        if j != 0 and at_val and indent > last_indent > 4:
            continue
        if j != 0 and ":" in line:
            if val:
                items[key].append(val.strip())
            at_val = False
            key = ''
        line = line.lstrip()
        for i, c in enumerate(line, 1):
            if at_val:
                val += c
            else:
                key += c
            if c == ':':
                at_val = True
            if i == len(line) and at_val and val:
                items[key].append(val.strip())
                val = ''
        last_indent = indent

    return items

print dict(funky_parse(string))

OUTPUT

{'farmers:': ['elmer fudd', 'farmer ted', 'farmer bill'], 'apples:': ['20'], 'veggies:': ['carrots', 'potatoes'], 'ripe:': ['yes'], 'oranges:': ['30'], 'selling:': ['yes']}

score 1 · Accepted Answer · answered Oct 15 '13 at 23:37

Here's a really dumb parser that takes into account your (apparent) indentation rules:

def parse(s):
    d = {}
    lastkey = None
    for fullline in s:
        line = fullline.strip()
        if not line:
            pass
        elif ':' not in line:
            indent = len(fullline) - len(fullline.lstrip())
            if lastindent is None:
                lastindent = indent
            if lastindent == indent:
                lastval.append(line)
        else:
            if lastkey:
                d[lastkey] = lastval
                lastkey = None
            if line.endswith(':'):
                lastkey, lastval, lastindent = key, [], None
            else:
                key, _, value = line.partition(':')
                d[key] = value.strip()
    if lastkey:
        d[lastkey] = lastval
        lastkey = None
    return d

import pprint
pprint(parse(string.splitlines()))

The output is:

{'apples': '20',
 'oranges': '30',
 'ripe': ['elmer fudd', 'farmer ted', 'farmer bill'],
 'selling': ['carrots', 'potatoes']}

I think this is already complicated enough that it would look cleaner as an explicit state machine, but I wanted to write this in terms that any novice could understand.

Thank you, this is a very clean solution. I was initially trying to solve this with a regex, but maybe the regex isn't worth the effort, and just incurs more complexity. — fr00z1, Oct 16 '13 at 00:10
@user2152283: Whenever I can't figure out how to do something with regexp (even if I'm sure it's a regular language I'm trying to parse), I step back and try to write it another way. Sometimes that lets me figure out the regexp subconsciously; sometimes it means I end up with a non-regexp-based but readable parser; sometimes I end up proving to myself that the language is non-regular or even context-sensitive and I'm going to need something more complicated… but no matter what, it's a win. — abarnert, Oct 16 '13 at 00:26

Parsing colon delimited data

3 Answers3