1

So I'm getting some data in the form of a string as a response after I make a request using the requests library, which I wanna finally convert into JSON using json.loads() method. The string is quite messy so I have to clean it so that it can be loaded as a JSON object.

The string can have extra quotation marks like:

{"address":""home address 25"street",
"date":"""}

What I am trying is to create a regexp that helps me in removing these extra quotations so the result is:

{"address":"home address 25 street",
"date":""}

What I thought of was to first create a regexp for all valid quotation marks and then try to match my string for all patterns except the matched ones and then replace them with an empty string like ''enter image description here

Here's the regexp I tried but it fails to detect all valid quotations As shown in the image, the quotations above red dot are valid ones and should've been detected.

Note that the last red dot has two quotations above it, that's the kind of issue which I wanna solve.
Also ignore the blacked out part, that's sensitive info.

  • 2
    The more important question is why the "json"string you are receiving is built faulty. Do you have access to the backend? You can invest a lot of time to clean that string but it may not be perfect and you'd be investing more time than just fixing the backend. – Tin Nguyen Jun 30 '20 at 08:52
  • Hey Tin, I'm scraping publicly available data so its not me who has worked on the Backend – Humaid Kidwai Jun 30 '20 at 09:05
  • Now I realize there was no need to black out certain parts of the image since its public anyway lol – Humaid Kidwai Jun 30 '20 at 09:07
  • Can you link the API endpoint? It may be already escaping the extra `\"`. I can't imagine they aren't sending you a non valid jsonstring. – Tin Nguyen Jun 30 '20 at 09:24
  • Endpoint: https://ibapi.in/Sale_Info_Home.aspx/bind_modal_detail needs a payload too.. you may try this as a test {prop_id: "SBIN00000000001"} – Humaid Kidwai Jun 30 '20 at 09:30
  • Server unreachable for me. Someone else has to try it. The network I am in may be blocking it. – Tin Nguyen Jun 30 '20 at 09:39
  • try with a VPN maybe? It's an Indian site – Humaid Kidwai Jun 30 '20 at 09:41
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216962/discussion-between-humaid-kidwai-and-tin-nguyen). – Humaid Kidwai Jun 30 '20 at 15:16

2 Answers2

1
import re

str1 = '''
{"address":""home address 25"street",
"date":"""}

'''
# Remove all " and \n
str2 = re.sub(r'["\n]', ' ', str1)

# Find all key, value pairs
data = re.findall(r'([^{,:]+):([^,:}]+)', str2)

# Reconstruct a dictionary
result = {key.strip(): value.strip() for key, value in data}

print(result)
Pramote Kuacharoen
  • 1,496
  • 1
  • 5
  • 6
0

You can probably just match all strings no matter what the content
as long as it is surrounded by a proper JSON structure.
Then replace double quotes accordingly from within a sub Callback Function.

The regex to match a pseudo-valid JSON string is this

r'([:\[,{]\s*)"(.*?)"(?=\s*[:,\]}])'

see https://regex101.com/r/vqn6e0/1

Within the callback use 2 regex to replace the quotes.

  • First one matches a quote that is not surrounded by other quotes
    r'(?<=[^"])"(?=[^"])' replace with a space.
  • Second one just replaces all quotes left with the empty string.

Python sample:

>>> import re
>>>
>>> text = '''
... {"address":""home address 25"street",
... "date":"""}
... '''
>>>
>>> def repl_call(m):
...     preq = m.group(1)
...     qbody = m.group(2)
...     qbody = re.sub( r'(?<=[^"])"(?=[^"])', ' ', qbody )
...     qbody = re.sub( r'"', '', qbody )
...     return preq + '"' + qbody + '"'
...
>>> print( re.sub( r'([:\[,{]\s*)"(.*?)"(?=\s*[:,\]}])', repl_call, text ))

{"address":"home address 25 street",
"date":""}