1

Let's say I have a string that looks like so:

text = '''
{"question":"In 2017, what was the approximate number of clinics in the US that provided abortion services?","category":"RFB","answers":["80","800","8000","80000"],"sources":["https://www.guttmacher.org/fact-sheet/induced-abortion-united-states"]} 
{"question":"Compared to actively religious US adults, how many unaffiliated US adults were active in non-religious voluntary organizations, such as charities?","category":"DFB","answers":["Slightly fewer (10% difference)","Slightly more (10% difference)","Many fewer (35% difference)","Many more (35% difference)"],"sources":["https://www.pewforum.org/2019/01/31/religions-relationship-to-happiness-civic-engagement-and-health-around-the-world/"]}
{"question":"In the US in 2015, there were ___ abortions per 1000 live births.","category":"DFB","answers":["12","80","124","188"],"sources":["https://www.cdc.gov/mmwr/volumes/67/ss/ss6713a1.htm?s_cid=ss6713a1_w"]}'''

I would like to convert this string into a python dictionary with the keys "question", "category", "answer", and "sources." Question and category will always be plaintext, whereas answers and sources will be in a list-like format with brackets.

I assume it will require the use of regex as in this answer with something of the form dictionary = dict(re.findall(r"\{(\S+)\s+\{*(.*?)\}+",text)) but can't quite get it to match all the keys I need.

Any thoughts?

The identified "duplicate" link doesn't solve my problem. I get "invalid syntax" error when using dictionary = ast.literal_eval(text), because I haven't successfully demarcated all the separate dictionaries from the string.

Parseltongue
  • 11,157
  • 30
  • 95
  • 160
  • 2
    Any reason you couldn't use `json.loads()` on those lines individually? – mayosten Oct 16 '19 at 17:48
  • How would I go about doing that? There aren't neatly demarcated lines in the actual string. It's just a hodgepodge of consecutive dictionaries enclosed by braces – Parseltongue Oct 16 '19 at 17:53
  • @Parseltongue why don't you just fix the source of this and use a real, well supported serialization format instead of just dumping a bunch of string representations of dictionaries to a file? – juanpa.arrivillaga Oct 16 '19 at 17:55
  • I don't have any control over the source. They were given to me like this. They are the output of a survey software that dumps string representations of dictionary like this for some reason – Parseltongue Oct 16 '19 at 17:55
  • So, there is no reliable delimiter? Not even a newline? then you are going to have to write some sort of bespoke parser – juanpa.arrivillaga Oct 16 '19 at 17:59
  • could your `text` be given as a single line? – RomanPerekhrest Oct 16 '19 at 18:02
  • @juanpa.arrivillaga The only reliable delimiter is the fact that every dictionary begins and ends with a brace. As far as I can tell, the dictionaries appear back-to-back such that the end of one dictionary is immediately followed by the beginning of a new dictionary. – Parseltongue Oct 16 '19 at 18:04

5 Answers5

1

You can try this, hope it's helpful. I return a list here, but it up to your purpose.

a = text.strip().split("\n")
import ast
b = []
for i in a:
    d = dict(ast.literal_eval(i))
    b.append(d)
>>>b
[{'question': 'In 2017, what was the approximate number of clinics in the US that provided abortion services?', 'category': 'RFB', 'answers': ['80', '800', '8000', '80000'], 'sources': ['https://www.guttmacher.org/fact-sheet/induced-abortion-united-states']}, {'question': 'Compared to actively religious US adults, how many unaffiliated US adults were active in non-religious voluntary organizations, such as charities?', 'category': 'DFB', 'answers': ['Slightly fewer (10% difference)', 'Slightly more (10% difference)', 'Many fewer (35% difference)', 'Many more (35% difference)'], 'sources': ['https://www.pewforum.org/2019/01/31/religions-relationship-to-happiness-civic-engagement-and-health-around-the-world/']}, {'question': 'In the US in 2015, there were ___ abortions per 1000 live births.', 'category': 'DFB', 'answers': ['12', '80', '124', '188'], 'sources': ['https://www.cdc.gov/mmwr/volumes/67/ss/ss6713a1.htm?s_cid=ss6713a1_w']}]

Lê Tư Thành
  • 1,063
  • 2
  • 10
  • 19
1

This works!!

The output for this code is

{'question':'abc', 'category':'abc', 'answers':['a', 'b', 'c'], 'sources': ['a', 'b', 'c']}
import json
text = '''{"question":"In 2017, what was the approximate number of clinics in the US that provided abortion services?","category":"RFB","answers":["80","800","8000","80000"],"sources":["https://www.guttmacher.org/fact-sheet/induced-abortion-united-states"]}
{"question":"Compared to actively religious US adults, how many unaffiliated US adults were active in non-religious voluntary organizations, such as charities?","category":"DFB","answers":["Slightly fewer (10 difference)","Slightly more (10 difference)","Many fewer (35 difference)","Many more (35 difference)"],"sources":["https://www.pewforum.org/2019/01/31/religions-relationship-to-happiness-civic-engagement-and-health-around-the-world/"]}
{"question":"In the US in 2015, there were ___ abortions per 1000 live births.","category":"DFB","answers":["12","80","124","188"],"sources":["https://www.cdc.gov/mmwr/volumes/67/ss/ss6713a1.htm?s_cid=ss6713a1_w"]}'''

text = '''{"question":"a","category":"a","answers":["a", "b"],"sources":["a"]}
{"question":"b","category":"b","answers":["b", "c"],"sources":["b"]}
{"question":"c","category":"c","answers":["c", "d"],"sources":["c"]}'''
outputDict = {"question":"", "category":"", "answers":[], "sources":[]}
for i in text.split('\n'):
    a = (json.loads(i))
    outputDict["question"]+=a["question"]
    outputDict["category"]+=a["category"]
    outputDict["answers"].append(a["answers"][0])
    outputDict["sources"].append(a["sources"][0])

print(outputDict)

Waqar Bin Kalim
  • 321
  • 1
  • 7
1

If you can absolutely guarantee that the source for your data is safe, then it can be as simple as:

exec(f"l={text}")
print(l) #{'question': 'In 2017, what was the approximate number of clinics in the US that provided abortion services?', 'category': 'RFB', 'answers': ['80', '800', '8000', '80000'], 'sources': ['https://www.guttmacher.org/fact-sheet/induced-abortion-united-states']}

If there is even a shadow of a chance that a malicious actor could get at your input text then don't do this, but it is as simple as it gets.

Turksarama
  • 1,136
  • 6
  • 13
0

Try the below code. Hope this will help:

text = '''
{"question":"In 2017, what was the approximate number of clinics in the US that provided abortion services?","category":"RFB","answers":["80","800","8000","80000"],"sources":["https://www.guttmacher.org/fact-sheet/induced-abortion-united-states"]} 
{"question":"Compared to actively religious US adults, how many unaffiliated US adults were active in non-religious voluntary organizations, such as charities?","category":"DFB","answers":["Slightly fewer (10% difference)","Slightly more (10% difference)","Many fewer (35% difference)","Many more (35% difference)"],"sources":["https://www.pewforum.org/2019/01/31/religions-relationship-to-happiness-civic-engagement-and-health-around-the-world/"]}
{"question":"In the US in 2015, there were ___ abortions per 1000 live births.","category":"DFB","answers":["12","80","124","188"],"sources":["https://www.cdc.gov/mmwr/volumes/67/ss/ss6713a1.htm?s_cid=ss6713a1_w"]}'''

import json
data = []
splited = text.split("}")
for i in range(len(splited)-1):
    data.append(json.loads(splited[i]+'}'))

print(data)
print(type(data[0]))

Ouput will be :

[{'question': 'In 2017, what was the approximate number of clinics in the US that provided abortion services?', 'category': 'RFB', 'answers': ['80', '800', '8000', '80000'], 'sources': ['https://www.guttmacher.org/fact-sheet/induced-abortion-united-states']}, {'question': 'Compared to actively religious US adults, how many unaffiliated US adults were active in non-religious voluntary organizations, such as charities?', 'category': 'DFB', 'answers': ['Slightly fewer (10% difference)', 'Slightly more(10% difference)', 'Many fewer (35% difference)', 'Many more (35% difference)'], 'sources': ['https://www.pewforum.org/2019/01/31/religions-relationship-to-happiness-civic-engagement-and-health-around-the-world/']}, {'question': 'In the US in 2015, there were ___ abortions per 1000 live births.', 'category': 'DFB', 'answers': ['12', '80', '124', '188'], 'sources': ['https://www.cdc.gov/mmwr/volumes/67/ss/ss6713a1.htm?s_cid=ss6713a1_w']}]
<class 'dict'>
Shishir Naresh
  • 743
  • 4
  • 10
0

I was able to generate my own answer like so:

a = [i for i in text.split("\n")]
for dicts in a:
  try:
    dict(eval(dicts))
  except:
    print("Failed") 
    print(dicts)
Parseltongue
  • 11,157
  • 30
  • 95
  • 160