9

I know there are a lot of other posts about parsing comma-separated values, but I couldn't find one that splits key-value pairs and handles quoted commas.

I have strings like this:

age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"

And I want to get this:

{
  'age': '12',
  'name': 'bob',
  'hobbies': 'games,reading',
  'phrase': "I'm cool!",
}

I tried using shlex like this:

lexer = shlex.shlex('''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''')
lexer.whitespace_split = True
lexer.whitespace = ','
props = dict(pair.split('=', 1) for pair in lexer)

The trouble is that shlex will split the hobbies entry into two tokens, i.e. hobbies="games and reading". Is there a way to make it take the double quotes into account? Or is there another module I can use?

EDIT: Fixed typo for whitespace_split

EDIT 2: I'm not tied to using shlex. Regex is fine too, but I didn't know how to handle the matching quotes.

Addison
  • 1,065
  • 12
  • 17
  • 3
    A better strategy could be like this: First split by equal signs and then split by the last comma in each string. – Seçkin Savaşçı Dec 20 '14 at 00:19
  • 1
    Easiest way o do that is to use a regexp. – poxip Dec 20 '14 at 00:22
  • 2
    @SeçkinSavaşçı: Unless there are equal signs within quotation marks... – Scott Hunter Dec 20 '14 at 00:24
  • If you insist on not using a regexp, and double quotes will always be used for strings and never appear within them, you could split the string on `"` so you can identify the quoted strings, and work around them. – Scott Hunter Dec 20 '14 at 00:27
  • @ScottHunter completely agree with you, that's why my suggestion is not qualified as a valid answer. – Seçkin Savaşçı Dec 20 '14 at 00:27
  • The post on splitting on semi-colons doesn't seem to address the type of quoting I have. – Addison Dec 20 '14 at 00:36
  • @ScottHunter I think the `csv` module could be used with a `=` delimiter, and this will protect against splitting on `=` within quotes. – jme Dec 20 '14 at 01:04
  • @Alex Martelli: It is *not* a duplicate. The adaptation of the accepted answer from [the linked question](http://stackoverflow.com/q/186857/4279) *does not* work for the input from the current question: `[next(csv.reader([item], delimiter='=')) for item in next(csv.reader([s]))]` (it fails to escape the comma in `"games,reading"`). The questions are similar but this question *requires* such quoting support. – jfs Dec 21 '14 at 20:19

5 Answers5

11

You just needed to use your shlex lexer in POSIX mode.

Add posix=True when creating the lexer.

(See the shlex parsing rules)

lexer = shlex.shlex('''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"''', posix=True)
lexer.whitespace_split = True
lexer.whitespace = ','
props = dict(pair.split('=', 1) for pair in lexer)

Outputs :

{'age': '12', 'phrase': "I'm cool!", 'hobbies': 'games,reading', 'name': 'bob'}

PS : Regular expressions won't be able to parse key-value pairs as long as the input can contain quoted = or , characters. Even preprocessing the string wouldn't be able to make the input be parsed by a regular expression, because that kind of input cannot be formally defined as a regular language.

pistache
  • 5,782
  • 1
  • 29
  • 50
  • 1
    I think this one is the best, since it handles all the cases and uses the `shlex` library, instead of reinventing the wheel. – Addison Aug 05 '16 at 17:49
5

It's possible to do with a regular expression. In this case, it might actually be the best option, too. I think this will work with most input, even escaped quotes such as this one: phrase='I\'m cool'

With the VERBOSE flag, it's possible to make complicated regular expressions quite readable.

import re
text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
regex = re.compile(
    r'''
        (?P<key>\w+)=      # Key consists of only alphanumerics
        (?P<quote>["']?)   # Optional quote character.
        (?P<value>.*?)     # Value is a non greedy match
        (?P=quote)         # Closing quote equals the first.
        ($|,)              # Entry ends with comma or end of string
    ''',
    re.VERBOSE
    )

d = {match.group('key'): match.group('value') for match in regex.finditer(text)}

print(d)  # {'name': 'bob', 'phrase': "I'm cool!", 'age': '12', 'hobbies': 'games,reading'}
Håken Lid
  • 22,318
  • 9
  • 52
  • 67
  • Nice! This is much cleaner and more understandable than using `shlex`. – Addison Dec 20 '14 at 01:23
  • Yes. If that is a an actual possible input, then escaped quotes should be sanitized somehow before running the regex search. Another example, which is almost plausible is: `phrase="I'm \"cool\", I think..."` – Håken Lid Dec 20 '14 at 03:27
  • @HåkenLid: I only bring this up because you've mentioned it in your answer. OP hasn't said anything about escaping quotes inside strings. btw, [FSM-based solution from my answer works in this case](http://ideone.com/wOZ01D) – jfs Dec 20 '14 at 03:36
  • It's something that might be relevant, depending on what kind of input could be expected. – Håken Lid Dec 20 '14 at 03:55
3

You could abuse Python tokenizer to parse the key-value list:

#!/usr/bin/env python
from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER

def parse_key_value_list(text):
    key = value = None
    for type, string, _,_,_ in generate_tokens(lambda it=iter([text]): next(it)):
        if type == NAME and key is None:
            key = string
        elif type in {NAME, NUMBER, STRING}:
            value = {
                NAME: lambda x: x,
                NUMBER: int,
                STRING: lambda x: x[1:-1]
            }[type](string)
        elif ((type == OP and string == ',') or
              (type == ENDMARKER and key is not None)):
            yield key, value
            key = value = None

text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
print(dict(parse_key_value_list(text)))

Output

{'phrase': "I'm cool!", 'age': 12, 'name': 'bob', 'hobbies': 'games,reading'}

You could use a finite-state machine (FSM) to implement a stricter parser. The parser uses only the current state and the next token to parse input:

#!/usr/bin/env python
from tokenize import generate_tokens, NAME, NUMBER, OP, STRING, ENDMARKER

def parse_key_value_list(text):
    def check(condition):
        if not condition:
            raise ValueError((state, token))

    KEY, EQ, VALUE, SEP = range(4)
    state = KEY
    for token in generate_tokens(lambda it=iter([text]): next(it)):
        type, string = token[:2]
        if state == KEY:
            check(type == NAME)
            key = string
            state = EQ
        elif state == EQ:
            check(type == OP and string == '=')
            state = VALUE
        elif state == VALUE:
            check(type in {NAME, NUMBER, STRING})
            value = {
                NAME: lambda x: x,
                NUMBER: int,
                STRING: lambda x: x[1:-1]
            }[type](string)
            state = SEP
        elif state == SEP:
            check(type == OP and string == ',' or type == ENDMARKER)
            yield key, value
            state = KEY

text = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''
print(dict(parse_key_value_list(text)))
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

Ok, I actually figured out a pretty nifty way, which is to split on both comma and equal sign, then take 2 tokens at a time.

input_str = '''age=12,name=bob,hobbies="games,reading",phrase="I'm cool!"'''

lexer = shlex.shlex(input_str)
lexer.whitespace_split = True
lexer.whitespace = ',='

ret = {}
try:
  while True:
    key = next(lexer)
    value = next(lexer)

    # Remove surrounding quotes
    if len(value) >= 2 and (value[0] == value[-1] == '"' or
                            value[0] == value[-1] == '\''):
      value = value[1:-1]

    ret[key] = value

except StopIteration:
  # Somehow do error checking to see if you ended up with an extra token.
  pass

print ret

Then you get:

{
  'age': '12',
  'name': 'bob',
  'hobbies': 'games,reading',
  'phrase': "I'm cool!",
}

However, this doesn't check that you don't have weird stuff like: age,12=name,bob, but I'm ok with that in my use case.

EDIT: Handle both double-quotes and single-quotes.

Addison
  • 1,065
  • 12
  • 17
0

Python seems to offer many ways to solve the task. Here is a little more c like implemented way, processing each char. Would be interesting to know different run times.

str = 'age=12,name=bob,hobbies="games,reading",phrase="I\'m cool!"'
key = ""
val = ""
dict = {}
parse_string = False
parse_key = True
# parse_val = False
for c in str:
    print(c)
    if c == '"' and not parse_string:
        parse_string = True
        continue
    elif c == '"' and parse_string:
        parse_string = False
        continue
    if parse_string:
        val += c
        continue
    if c == ',': # terminate entry
        dict[key] = val #add to dict
        key = ""
        val = ""
        parse_key = True
        continue
    elif c == '=' and parse_key:
        parse_key = False
    elif parse_key:
        key += c
    else:
        val+=c
dict[key] = val
print(dict.items())
# {'phrase': "I'm cool!", 'age': '12', 'name': 'bob', 'hobbies': 'games,reading'}

demo: http://repl.it/6oC/1

Karl Adler
  • 15,780
  • 10
  • 70
  • 88