I'm currently using BeautifulSoup to web-scrape listings from a jobs website, and outputting the data into JSON via the site's HTML code.
I fix bugs with regex as they come along, but this particular issue has me stuck. When webscraping the job listing, instead of extracting info from each container of interest, I've chosen to instead extract JSON data within the HTML source code (< script type = "application/ld+json" >
). From there I convert the BeautifulSoup results into strings, clean out the HTML leftovers, then convert the string into a JSON. However, I've hit a snag due to text within the job listing using quotes. Since the actual data is large, I'll just use a substitute.
example_string = '{"Category_A" : "Words typed describing stuff",
"Category_B" : "Other words speaking more irrelevant stuff",
"Category_X" : "Here is where the "PROBLEM" lies"}'
Now the above won't run in Python, but the string I have that has been extracted from the job listing's HTML is pretty much in the above format. When it's passed into json.loads()
, it returns the error: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5035
I'm not at all sure how to address this issue.
EDIT Here's the actual code leading to the error:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import json, re
uClient = urlopen("http://www.ethiojobs.net/display-job/227974/Program-Manager---Mental-Health%2C-Child-Care-Gender-%26-Protection.html")
page_html = uClient.read()
uClient.close()
listing_soup = BeautifulSoup(page_html, "lxml")
json_script = listing_soup.find("script", "type":"application/ld+json"}).strings
extracted_json_str = ''.join(json_script)
## Clean up the string with regex
extracted_json_str_CLEAN1 = re.sub(pattern = r"\r+|\n+|\t+|\\l+| | |amp;|\u2013|</?.{,6}>", # last is to get rid of </p> and </strong>
repl='',
string = extracted_json_str)
extracted_json_str_CLEAN2 = re.sub(pattern = r"\\u2019",
repl = r"'",
string = extracted_json_str_CLEAN1)
extracted_json_str_CLEAN3 = re.sub(pattern=r'\u25cf',
repl=r" -",
string = extracted_json_str_CLEAN2)
extracted_json_str_CLEAN4 = re.sub(pattern=r'\\',
repl="",
string = extracted_json_str_CLEAN3)
## Convert to JSON (HERE'S WHERE THE ERROR ARISES)
json_listing = json.loads(extracted_json_str_CLEAN4)
I do know what's leading to the error: within the last bullet point of Objective 4 in the job description, the author used quotes when referring to a required task of the job (i.e. "quality control" ). The way I've been going about extracting information from these job listings, a simple instance of someone using quotes causes my whole approach to blow up. Surely there's got to be a better way to build this script without such liabilities like this (as well as having to use regex to fix each breakdown as they arise).
Thanks!