Python tokenize sentence with optional key/val pairs

Question

I'm trying to parse a sentence (or line of text) where you have a sentence and optionally followed some key/val pairs on the same line. Not only are the key/value pairs optional, they are dynamic. I'm looking for a result to be something like:

Input:

"There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

Output:

Values = {'theSentence' : "There was a cow at home.",
          'home' : "mary",
          'cowname' : "betsy",
          'date'= "10-jan-2013"
         }

Input:

"Mike ordered a large hamburger. lastname=Smith store=burgerville"

Output:

Values = {'theSentence' : "Mike ordered a large hamburger.",
          'lastname' : "Smith",
          'store' : "burgerville"
         }

Input:

"Sam is nice."

Output:

Values = {'theSentence' : "Sam is nice."}

Thanks for any input/direction. I know the sentences appear that this is a homework problem, but I'm just a python newbie. I know it's probably a regex solution, but I'm not the best regarding regex.

Can you assume that `=` will not appear in the sentence itself? — FastTurtle, Jul 22 '13 at 18:53
is there a compelling reason the variables follow one form and the sentence does not? ie "thesentence=some sentence you want to see". Ideally you'd have some delimiter here. — Brad, Jul 22 '13 at 18:53

georg · Accepted Answer · 2013-07-22T19:16:03.137

4

I'd use re.sub:

import re

s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

d = {}

def add(m):
    d[m.group(1)] = m.group(2)

s = re.sub(r'(\w+)=(\S+)', add, s)
d['theSentence'] = s.strip()

print d

Here's more compact version if you prefer:

d = {}
d['theSentence'] = re.sub(r'(\w+)=(\S+)',
    lambda m: d.setdefault(m.group(1), m.group(2)) and '',
    s).strip()

Or, maybe, findall is a better option:

rx = '(\w+)=(\S+)|(\S.+?)(?=\w+=|$)'
d = {
    a or 'theSentence': (b or c).strip()
    for a, b, c in re.findall(rx, s)
}
print d

edited Jul 22 '13 at 19:16

answered Jul 22 '13 at 19:04

georg

211,518
52
313
390

cmooon, make it a one-liner. You know you want to – Slater Victoroff Jul 22 '13 at 19:07
@SlaterTyranus The Zen says: Sparse is better than dense. – MGP Jul 22 '13 at 19:11
Thanks for the quick response! It's very much appreciated. This solution works both with and without periods, so it's great. – tazzytazzy Jul 22 '13 at 20:05

A.Wan · Answer 2 · 2013-07-22T19:02:22.173

1

The first step is to do

inputStr = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
theSentence, others = str.split('.')

You're going to then want to break up "others". Play around with split() (the argument you pass in tells Python what to split the string on), and see what you can do. :)

edited Jul 22 '13 at 19:02

answered Jul 22 '13 at 18:53

A.Wan

1,818
3
21
34

2

Don't name variables `str`, native datatype!! – MGP Jul 22 '13 at 19:01
@ManuelGutierrez thanks! Wow that's a bad habit I accidentally developed, always assumed it was `string` and so `str` was safe... – A.Wan Jul 22 '13 at 19:02
1

This doesn't... answer the question at all. Why does this have more upvotes than answers that are actually answers? – Slater Victoroff Jul 22 '13 at 19:08

score 1 · Answer 3 · answered Jul 22 '13 at 18:58

If your sentence is guaranteed to end on ., then, you could follow the following approach.

>>> testList = inputString.split('.')
>>> Values['theSentence'] = testList[0]+'.'

For the rest of the values, just do.

>>> for elem in testList[1].split():
        key, val = elem.split('=')
        Values[key] = val

Giving you a Values like so

>>> Values
{'date': '10-jan-2013', 'home': 'mary', 'cowname': 'betsy', 'theSentence': 'There was a cow at home.'}
>>> Values2
{'lastname': 'Smith', 'theSentence': 'Mike ordered a large hamburger.', 'store': 'burgerville'}
>>> Values3
{'theSentence': 'Sam is nice.'}

alecxe · Answer 4 · 2013-07-22T19:06:54.717

1

Assuming there could be only 1 dot, that divides the sentence and assignment pairs:

input = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
sentence, assignments = input.split(". ")

result = {'theSentence': sentence + "."}
for item in assignments.split():
    key, value = item.split("=")
    result[key] = value

print result

prints:

{'date': '10-jan-2013', 
 'home': 'mary', 
 'cowname': 'betsy', 
 'theSentence': 'There was a cow at home.'}

edited Jul 22 '13 at 19:06

answered Jul 22 '13 at 18:58

alecxe

462,703
120
1,088
1,195

+1 We think identical on this one, I'm not even posting mine. BTW why the `if item:` check? Looks like the for will do. – MGP Jul 22 '13 at 19:05
Thank you, I've removed `if item` check and switched to splitting by `. ` instead of just dot. – alecxe Jul 22 '13 at 19:07

score 0 · Answer 5 · answered Jul 22 '13 at 19:00

Assuming = doesn't appear in the sentence itself. This seems to be more valid than assuming the sentence ends with a ..

s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

eq_loc = s.find('=')
if eq_loc > -1:
    meta_loc = s[:eq_loc].rfind(' ')
    s = s[:meta_loc]
    metastr = s[meta_loc + 1:]

    metadict = dict(m.split('=') for m in metastr.split())
else:
    metadict = {}

metadict["theSentence"] = s

score 0 · Answer 6 · answered Jul 22 '13 at 19:01

So as usual, there's a bunch of ways to do this. Here's a regexp based approach that looks for key=value pairs:

import re

sentence = "..."

values = {}
for match in re.finditer("(\w+)=(\S+)", sentence):
    if not values:
        # everything left to the first key/value pair is the sentence                                                                               
        values["theSentence"] = sentence[:match.start()].strip()
    else:
        key, value = match.groups()
        values[key] = value
if not values:
    # no key/value pairs, keep the entire sentence
    values["theSentence"] = sentence

This assumes that the key is a Python-style identifiers, and that the value consists of one or more non-whitespace characters.

score 0 · Answer 7 · answered Jul 22 '13 at 19:04

0

Supposing that the first period separates the sentence from the values, you can use something like this:

#! /usr/bin/python3

a = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

values = (lambda s, tail: (lambda d, kv: (d, d.update (kv) ) ) ( {'theSentence': s}, {k: v for k, v in (x.split ('=') for x in tail.strip ().split (' ') ) } ) ) (*a.split ('.', 1) ) [0]

print (values)

answered Jul 22 '13 at 19:04

Hyperboreus

31,997
9
47
87

lambdas are slow and a bit overkill for this methinks. – Slater Victoroff Jul 22 '13 at 19:12
There have been various discussion on stackoverflow comparing lambda expressions with named functions. IIRC, once compiled there is no way to tell them apart, but I am not sure though. But my point was more to show the multi-paradigm character of python. Use it procedural (like the other answers here), functional (like mine), object-oriented, whatever suits you best according to your personal preferences. – Hyperboreus Jul 22 '13 at 20:13

score 0 · Answer 8 · answered Jul 22 '13 at 19:12

0

Nobody posted a comprehensible one-liner. The question is answered, but gotta do it in one line, it's the Python way!

{"theSentence": sentence.split(".")[0]}.update({item.split("=")[0]: item.split("=")[1] for item in sentence.split(".")[1].split()})

Eh, not super elegant, but it's totally in one line. No imports even.

answered Jul 22 '13 at 19:12

Slater Victoroff

21,376
21
85
144

In my opinion, that's the exact opposite of the Python way. – netcoder Jul 22 '13 at 19:17
1

If I wanted to get headaches while coding, I'd use Perl. :P – netcoder Jul 22 '13 at 19:22

score 0 · Answer 9 · answered Jun 15 '21 at 16:39

use the regular expression findall. the first capture group is the sentence. | is the or condition for the second capture group: one or more spaces, one or more characters, the equal sign, and one or more non space characters.

s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
all_matches = re.findall(r'([\w+\s]+\.{1})|((\s+\w+)=(\S+))',s)
d={}
for i in np.arange(len(all_matches)):
   #print(all_matches[i])
   if all_matches[i][0] != "":
       d["theSentence"]=all_matches[i][0]
   else:
       d[all_matches[i][2]]=all_matches[i][3]
   
print(d)

output:

  {'theSentence': 'There was a cow at home.', ' home': 'mary', ' cowname': 'betsy', ' date': '10-jan-2013'}

Python tokenize sentence with optional key/val pairs

9 Answers9