-1

I'm currently using BeautifulSoup to web-scrape listings from a jobs website, and outputting the data into JSON via the site's HTML code.

I fix bugs with regex as they come along, but this particular issue has me stuck. When webscraping the job listing, instead of extracting info from each container of interest, I've chosen to instead extract JSON data within the HTML source code (< script type = "application/ld+json" >). From there I convert the BeautifulSoup results into strings, clean out the HTML leftovers, then convert the string into a JSON. However, I've hit a snag due to text within the job listing using quotes. Since the actual data is large, I'll just use a substitute.

example_string = '{"Category_A" : "Words typed describing stuff",
                   "Category_B" : "Other words speaking more irrelevant stuff",
                   "Category_X" : "Here is where the "PROBLEM" lies"}'

Now the above won't run in Python, but the string I have that has been extracted from the job listing's HTML is pretty much in the above format. When it's passed into json.loads(), it returns the error: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035

I'm not at all sure how to address this issue.

EDIT Here's the actual code leading to the error:

from bs4 import BeautifulSoup
from urllib.request import urlopen
import json, re

uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()

listing_soup = BeautifulSoup(page_html, "lxml")

json_script = listing_soup.find("script", "type":"application/ld+json"}).strings

extracted_json_str = ''.join(json_script)

## Clean up the string with regex
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+|  |&nbsp;|amp;|\u2013|</?.{,6}>", # last is to get rid of </p> and </strong>
                                repl='', 
                                string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",
                                repl = r"'",
                                string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',
                                repl=r" -",
                                string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',
                                repl="",
                                string = extracted_json_str_CLEAN3)

## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)

I do know what's leading to the error: within the last bullet point of Objective 4 in the job description, the author used quotes when referring to a required task of the job (i.e. "quality control" ). The way I've been going about extracting information from these job listings, a simple instance of someone using quotes causes my whole approach to blow up. Surely there's got to be a better way to build this script without such liabilities like this (as well as having to use regex to fix each breakdown as they arise).

Thanks!

Yacob
  • 87
  • 1
  • 9
  • There's no safe way to remove bad quotation marks from a JSON string that works in all cases. Show us how you got to that JSON. The root of the problem should be in that code. – Klaus D. Nov 09 '19 at 03:46
  • How are you building your JSON? It sounds like you're doing it by hand instead of using a proper serializer (like `json.dumps`). A proper serializer wouldn't make this kind of mistake. – user2357112 Nov 09 '19 at 03:47
  • Although you're then calling `json.loads`, so it's not clear why JSON is involved in this process at all. As you've described it, the contents of the original web page aren't any sort of JSON, valid or invalid, and the JSON is something you're trying to build yourself for some reason. – user2357112 Nov 09 '19 at 03:49

2 Answers2

0
# WHen you extracting this I think you shood make a chekc for this.
# example:
if "\"" in extraction:
    extraction = extraction.replace("\"", "\'")
print(extraction)

In this case you will convert " from extraction in ' I mean something you will need to convert because python give uyou a way to use both if uyou want to use " inside of a string you will need to inverse that simbols:

example:

"this is a 'test'"
'this was a "test"'
"this is not a \"test\""
#in case the condition is meat
if "\"" in item:
    #use this
    item = item.replace("\"", "\'")
    #or use this
    item = item.replace("\"", "\\\"")
0

you need to apply the escape sequence(\) if you want double Quote(") in your value. So, your String input to json.loads() should look like below.

example_string = '{"Category_A": "Words typed describing stuff", "Category_B": "Other words speaking more irrelevant stuff", "Category_X": "Here is where the \\"PROBLEM\\" lies"}'

json.loads can parse this.

Gouri
  • 347
  • 1
  • 5
  • Explain/investigate how to escape the quotes automatically, as the data the OP says comes from a web page – Pynchia Nov 09 '19 at 06:52