1

Consider this text:

Would you like to have responses to your questions sent to you via email ?

I'm going to propose multiple choices for several words by marking up them like that:

Would you like [to get]|[having]|g[to have] responses to your questions sent [up to]|g[to]|[on] you via email ?

The choices are bracketted and separated by pipes
The good choice is preceded by a g

I would like to parse this sentence to get the text formatted like that:

Would you like __ responses to your questions sent __ you via email ?

With a list like:

[
  [
    {"to get":0},
    {"having":0},
    {"to have":1},
  ],
  [
    {"up to":0},
    {"to":1},
    {"on":0},
  ],
]

Is my markup design ok ?
How to regex the sentence to get the needed result and generate the list ?

edit: User oriented markup language needed

Pierre de LESPINAY
  • 44,700
  • 57
  • 210
  • 307
  • I don't meant to go on a rant here, but.. XML is pretty good. It's even got 'markup language' in the name. And, its extensible. Thats also in the name. If your inquiry is a thought-experiment, then yeah, I guess it 'works' but .. gah. It's 2011, writing JSON parsers for proprietary formats is heading the wrong direction. XML is awesome, I don't care what anyone says. lalalala I can't hear you! – synthesizerpatel Jan 06 '12 at 12:40
  • I don't want to go on a rant either, but this looks more like someone looking for answers to a homework problem than a thought experiment. If that's the case, add a `homework` tag please. And regardless of whether it's original thought or an assignment, post your regex code, what you already tried, and what problems you encountered. – Dave Jan 06 '12 at 12:54
  • It's not a homework question. All markup tools like TinyMCE or Markdown have their markup format and they don't use the good __old__ XML because of its verbosity. Thank you for your suggestion @synthesizerpatel but I'm not building a JSON parser. I'm just trying parse a markedup text to get some variables, nothing more. – Pierre de LESPINAY Jan 06 '12 at 13:50

4 Answers4

3

I would add some grouping parentheses {}, and output not list of list of dicts, but list of dicts.

Code:

import re

s = 'Would you like {[to get]|[having]|g[to have]} responses to your questions sent {[up to]|g[to]|[on]} you via email ?'

def variants_to_dict(variants):
    dct = {}
    for is_good, s in variants:
        dct[s] = 1 if is_good == 'g' else 0
    return dct

def question_to_choices(s):
    choices_re = re.compile(r'{[^}]+}')
    variants_re = re.compile(r'''\|?(g?)
                                 \[
                                    ([^\]]+)
                                 \]
                                ''', re.VERBOSE)
    choices_list = []
    for choices in choices_re.findall(s):
        choices_list.append(variants_to_dict(variants_re.findall(choices)))

    return choices_re.sub('___', s), choices_list

question, choices = question_to_choices(s)
print question
print choices

Output:

Would you like ___ responses to your questions sent ___ you via email ?
[{'to have': 1, 'to get': 0, 'having': 0}, {'to': 1, 'up to': 0, 'on': 0}]
reclosedev
  • 9,352
  • 34
  • 51
2

Rough parsing implementation using regular expressions:

import re
s = "Would you like [to get]|[having]|g[to have] responses to your questions sent [up to]|g[to]|[on] you via email ?"   # pattern string

choice_groups = re.compile(r"((?:g?\[[^\]]+\]\|?)+)")  # regex to get choice groups
choices = re.compile(r"(g?)\[([^\]]+)\]")  # regex to extract choices within each group

# now, use the regexes to parse the string:
groups = choice_groups.findall(s)
# returns: ['[to get]|[having]|g[to have]', '[up to]|g[to]|[on]']

# parse each group to extract possible choices, along with if they are good
group_choices = [choices.findall(group) for group in groups]
# will contain [[('', 'to get'), ('', 'having'), ('g', 'to have')], [('', 'up to'), ('g', 'to'), ('', 'on')]]

# finally, substitute each choice group to form a template
template = choice_groups.sub('___', s)
# template is "Would you like ___ responses to your questions sent ___ you via email ?"

Parsing this to suit your format should be pretty easy now. Good luck :)

pawroman
  • 1,270
  • 8
  • 12
2

I will suggest my solution too:

Would you like {to get|having|+to have} responses to your questions sent {up to|+to|on} you via email ?

def extract_choices(text):
    choices = []

    def callback(match):
        variants = match.group().strip('{}')
        choices.append(dict(
            (v.lstrip('+'), v.startswith('+'))
            for v in variants.split('|')
        ))
        return '___'

    text = re.sub('{.*?}', callback, text)

    return text, choices

Lets try it:

>>> t = 'Would you like {to get|having|+to have} responses to your questions    sent {up to|+to|on} you via email?'
>>> pprint.pprint(extract_choices(t))
... ('Would you like ___ responses to your questions sent ___ you via email?',
... [{'having': False, 'to get': False, 'to have': True},
...  {'on': False, 'to': True, 'up to': False}])
Ski
  • 14,197
  • 3
  • 54
  • 64
  • credits for this markup goes to @reclosedev, I took his idea about `{}` and stripped unnecessary things :) – Ski Jan 06 '12 at 14:17
1

I also think that for this task xml is much more appropriate because there are already a lot of tools available that will make parsing much easier and less error-prone.

Anyway, if you decide to use your design, I'd do something like this:

import re

question_str = ("Would you like [to get]|[having]|g[to have] "
                "responses to your questions sent "
                "[up to]|g[to]|[on] you via email ?")

def option_to_dict(option_str):
     if option_str.startswith('g'):
          name = option_str.lstrip('g')
          value = 1
     else:
          name = option_str
          value = 0
     name = name.strip('[]')
     return {name: value}

regex = re.compile('g?\[[^]]+\](\|g?\[[^]]+\])*')

options = [[option_to_dict(option_str)
            for option_str in match.group(0).split('|')]
           for match in regex.finditer(question_str)]
print options

question = regex.sub('___', question_str)
print question

Example output:

[[{'to get': 0}, {'having': 0}, {'to have': 1}], [{'up to': 0}, {'to': 1}, {'on': 0}]]
Would you like ___ responses to your questions sent ___ you via email ?

Note: Regarding the design, I think it would be better to have a mark to set start/end of the whole set of options (not just one for single options).

jcollado
  • 39,419
  • 8
  • 102
  • 133
  • 2
    May I suggest replacing `name = option_str[2:-1]` and `name = option_str[1:-1]`, with one simple `option_str.strip('g[]')`? – D K Jan 06 '12 at 12:56
  • @DK Sure, that makes the code more readable. Thanks for your suggestion. – jcollado Jan 06 '12 at 13:02
  • @DK Finally I haven't added `g` to `strip` because that would drop characters from the option itself if it starts/ends with `g`. – jcollado Jan 06 '12 at 13:06
  • Yeah, I just realised that. Also, `if option_str[0] == 'g': name = option_str[1:]` should be `if option_str.startswith('g'): name = option_str.lstrip('g')`. As the brackets have not been stripped, stripping `'g'` can only remove the one outside the brackets. – D K Jan 06 '12 at 13:24
  • Thank you @jcollado. I don't understand the use of XML there, can you give an example please ? – Pierre de LESPINAY Jan 06 '12 at 13:54
  • @Glide What about something like this? `Would you like etc.?` – jcollado Jan 06 '12 at 14:00
  • It's a user oriented markup language... Sorry I though it were obvious. – Pierre de LESPINAY Jan 06 '12 at 14:19