5

I have a following string in python:

Date: 07/14/1995 Time: 11:31:50 Subject text: Something-cool

I want to prepare a dict() from it with following key: [value]

{"Date":["07/13/1995"], "Time": ["11:31:50"], "Subject text":["Something-cool"]}

If I split the string with : I get the following. How can I get the above desired result?

>>> text.split(": ")
['Date', '07/14/1995 Time', '11:31:50 Subject text', 'Something-cool']
cs95
  • 379,657
  • 97
  • 704
  • 746
Anthony
  • 33,838
  • 42
  • 169
  • 278
  • Since your values are always one word, I guess you could `.split(' ')` each value (except the first and the last) in `text.split(": ")`, and then take the first result as the value and the rest as a key. – Jordan A. May 27 '18 at 03:21
  • Is something like `Time: 11:30 PM text: something` possible? With spaces in the value? – user3483203 May 27 '18 at 03:26

1 Answers1

8

Let's use re.findall here:

>>> import re
>>> dict(re.findall(r'(?=\S|^)(.+?): (\S+)', text))
{'Date': '07/14/1995', 'Subject text': 'Something-cool', 'Time': '11:31:50'}

Or, if you insist on the format,

>>> {k : [v] for k, v in re.findall(r'(?=\S|^)(.+?): (\S+)', text)}
{
   'Date'        : ['07/14/1995'],
   'Subject text': ['Something-cool'],
   'Time'        : ['11:31:50']
}

Details

(?=   # lookahead 
\S    # anything that isn't a space
|     # OR
^     # start of line
) 
(.+?) # 1st capture group - 1 or more characters, until...
:     # ...a colon
\s    # space
(\S+) # 2nd capture group - one or more characters that are not wsp 

Semantically, this regex means "get me all pairs of items that follow this particular pattern of something followed by a colon and whitespace and a bunch of characters that are not whitespace". The lookahead at the start is so that the groups are not captured with a leading whitespace (and lookbehinds support only fixed-width assertions, so).

Note: This will fail if your values have spaces in them.


If you're doing this for multiple lines in a text file, let's build on this regex and use a defaultdict:

from collections import defaultdict
d = defaultdict(list)

with open(file) as f:
    for text in file:
        for k, v in re.findall(r'(?=\S|^)(.+?): (\S+)', text.rstrip()):
            d[k].append(v)

This will add one or more values to your dictionary for a given key.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • @Anthony I've done my level best to explain the regex as intuitively as I could (regex is by no means intuitive), so feel free to ask me about anything that you don't understand here. – cs95 May 27 '18 at 03:27
  • I think I will have to go with the second approach. My intent is to have multiple values for a single key. So I need the values to be a list. – Anthony May 27 '18 at 03:28
  • I will have certain cases where I will have spaces in values. Is there a way to resolve that? – Anthony May 27 '18 at 03:30
  • @Anthony Alright. If you are iterating over multiple lines in a text file, I've added a defaultdict based solution that should help. – cs95 May 27 '18 at 03:30
  • @Anthony No, very difficult. Where/how do you decide if something is part of the next key or part of the previous value? Quite challenging. – cs95 May 27 '18 at 03:31
  • @Anthony the format of your string would be ambiguous if you allowed spaces in values. So it's not just hard, it would be impossible to parse it. – Olivier Melançon May 27 '18 at 03:32
  • Ok - cases where there are spaces in values are different lines and on those lines there is only one `:`. So I can write a if/else based on # of `:` in a line. – Anthony May 27 '18 at 03:32
  • @Anthony Okay, I'll leave it to you, since a solution for that would depend on your actual data :-) – cs95 May 27 '18 at 03:34
  • 2
    @Anthony just follow PEP8 and don't allow spaces in your keys, then you can have spaces in your values, you just have to restrict one or the other. – user3483203 May 27 '18 at 03:36
  • @coldspeed For the line of code you provided `{k : [v] for k, v in re.findall(r'(?=\S|^)(.+?): (\S+)', text)}` How can I add that to a `dict()` that already has values in it. i.e. instead of doing `my_dict = {k : [v] for k, v in re.findall(r'(?=\S|^)(.+?): (\S+)', text)}` I want to add the output of that line to `my_dict` which already has some keys in it. – Anthony May 29 '18 at 10:55