5

I'm trying to parse a sentence (or line of text) where you have a sentence and optionally followed some key/val pairs on the same line. Not only are the key/value pairs optional, they are dynamic. I'm looking for a result to be something like:

Input:

"There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

Output:

Values = {'theSentence' : "There was a cow at home.",
          'home' : "mary",
          'cowname' : "betsy",
          'date'= "10-jan-2013"
         }

Input:

"Mike ordered a large hamburger. lastname=Smith store=burgerville"

Output:

Values = {'theSentence' : "Mike ordered a large hamburger.",
          'lastname' : "Smith",
          'store' : "burgerville"
         }

Input:

"Sam is nice."

Output:

Values = {'theSentence' : "Sam is nice."}

Thanks for any input/direction. I know the sentences appear that this is a homework problem, but I'm just a python newbie. I know it's probably a regex solution, but I'm not the best regarding regex.

Rohit Jain
  • 209,639
  • 45
  • 409
  • 525
tazzytazzy
  • 65
  • 6

9 Answers9

4

I'd use re.sub:

import re

s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

d = {}

def add(m):
    d[m.group(1)] = m.group(2)

s = re.sub(r'(\w+)=(\S+)', add, s)
d['theSentence'] = s.strip()

print d

Here's more compact version if you prefer:

d = {}
d['theSentence'] = re.sub(r'(\w+)=(\S+)',
    lambda m: d.setdefault(m.group(1), m.group(2)) and '',
    s).strip()

Or, maybe, findall is a better option:

rx = '(\w+)=(\S+)|(\S.+?)(?=\w+=|$)'
d = {
    a or 'theSentence': (b or c).strip()
    for a, b, c in re.findall(rx, s)
}
print d
georg
  • 211,518
  • 52
  • 313
  • 390
1

The first step is to do

inputStr = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
theSentence, others = str.split('.')

You're going to then want to break up "others". Play around with split() (the argument you pass in tells Python what to split the string on), and see what you can do. :)

A.Wan
  • 1,818
  • 3
  • 21
  • 34
1

If your sentence is guaranteed to end on ., then, you could follow the following approach.

>>> testList = inputString.split('.')
>>> Values['theSentence'] = testList[0]+'.'

For the rest of the values, just do.

>>> for elem in testList[1].split():
        key, val = elem.split('=')
        Values[key] = val

Giving you a Values like so

>>> Values
{'date': '10-jan-2013', 'home': 'mary', 'cowname': 'betsy', 'theSentence': 'There was a cow at home.'}
>>> Values2
{'lastname': 'Smith', 'theSentence': 'Mike ordered a large hamburger.', 'store': 'burgerville'}
>>> Values3
{'theSentence': 'Sam is nice.'}
Sukrit Kalra
  • 33,167
  • 7
  • 69
  • 71
1

Assuming there could be only 1 dot, that divides the sentence and assignment pairs:

input = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
sentence, assignments = input.split(". ")

result = {'theSentence': sentence + "."}
for item in assignments.split():
    key, value = item.split("=")
    result[key] = value

print result

prints:

{'date': '10-jan-2013', 
 'home': 'mary', 
 'cowname': 'betsy', 
 'theSentence': 'There was a cow at home.'}
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • +1 We think identical on this one, I'm not even posting mine. BTW why the `if item:` check? Looks like the for will do. – MGP Jul 22 '13 at 19:05
  • Thank you, I've removed `if item` check and switched to splitting by `. ` instead of just dot. – alecxe Jul 22 '13 at 19:07
0

Assuming = doesn't appear in the sentence itself. This seems to be more valid than assuming the sentence ends with a ..

s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

eq_loc = s.find('=')
if eq_loc > -1:
    meta_loc = s[:eq_loc].rfind(' ')
    s = s[:meta_loc]
    metastr = s[meta_loc + 1:]

    metadict = dict(m.split('=') for m in metastr.split())
else:
    metadict = {}

metadict["theSentence"] = s
FastTurtle
  • 2,301
  • 19
  • 19
0

So as usual, there's a bunch of ways to do this. Here's a regexp based approach that looks for key=value pairs:

import re

sentence = "..."

values = {}
for match in re.finditer("(\w+)=(\S+)", sentence):
    if not values:
        # everything left to the first key/value pair is the sentence                                                                               
        values["theSentence"] = sentence[:match.start()].strip()
    else:
        key, value = match.groups()
        values[key] = value
if not values:
    # no key/value pairs, keep the entire sentence
    values["theSentence"] = sentence

This assumes that the key is a Python-style identifiers, and that the value consists of one or more non-whitespace characters.

Fredrik
  • 940
  • 4
  • 10
0

Supposing that the first period separates the sentence from the values, you can use something like this:

#! /usr/bin/python3

a = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"

values = (lambda s, tail: (lambda d, kv: (d, d.update (kv) ) ) ( {'theSentence': s}, {k: v for k, v in (x.split ('=') for x in tail.strip ().split (' ') ) } ) ) (*a.split ('.', 1) ) [0]

print (values)
Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
  • lambdas are slow and a bit overkill for this methinks. – Slater Victoroff Jul 22 '13 at 19:12
  • There have been various discussion on stackoverflow comparing lambda expressions with named functions. IIRC, once compiled there is no way to tell them apart, but I am not sure though. But my point was more to show the multi-paradigm character of python. Use it procedural (like the other answers here), functional (like mine), object-oriented, whatever suits you best according to your personal preferences. – Hyperboreus Jul 22 '13 at 20:13
0

Nobody posted a comprehensible one-liner. The question is answered, but gotta do it in one line, it's the Python way!

{"theSentence": sentence.split(".")[0]}.update({item.split("=")[0]: item.split("=")[1] for item in sentence.split(".")[1].split()})

Eh, not super elegant, but it's totally in one line. No imports even.

Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
0

use the regular expression findall. the first capture group is the sentence. | is the or condition for the second capture group: one or more spaces, one or more characters, the equal sign, and one or more non space characters.

s = "There was a cow at home. home=mary cowname=betsy date=10-jan-2013"
all_matches = re.findall(r'([\w+\s]+\.{1})|((\s+\w+)=(\S+))',s)
d={}
for i in np.arange(len(all_matches)):
   #print(all_matches[i])
   if all_matches[i][0] != "":
       d["theSentence"]=all_matches[i][0]
   else:
       d[all_matches[i][2]]=all_matches[i][3]
   
print(d)

output:

  {'theSentence': 'There was a cow at home.', ' home': 'mary', ' cowname': 'betsy', ' date': '10-jan-2013'}
Golden Lion
  • 3,840
  • 2
  • 26
  • 35