I am trying to match listings of products in a JSON lines format with products in another file also in JSON format. This is sometimes called Record Linkage, Entity Resolution, Reference Reconciliation, or just matching.
The goal is to match product listings from a 3rd party retailer, e.g. “Nikon D90 12.3MP Digital SLR Camera (Body Only)” against a set of known products, e.g. “Nikon D90”.
Details
Data Objects
Product
{
"product_name": String // A unique id for the product
"manufacturer": String
"family": String // optional grouping of products
"model": String
"announced-date": String // ISO-8601 formatted date string, e.g. 2011-04-28T19:00:00.000-05:00
}
Listing
{
"title": String // description of product for sale
"manufacturer": String // who manufactures the product for sale
"currency": String // currency code, e.g. USD, CAD, GBP, etc.
"price": String // price, e.g. 19.99, 100.00
}
Result
{
"product_name": String
"listings": Array[Listing]
}
Data Contains two files: products.txt – Contains around 700 products listings.txt – Contains about 20,000 product listings
Current code (using python):
import jsonlines
import json
import re
import logging, sys
logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
with jsonlines.open('products.jsonl') as products:
for prod in products:
jdump = json.dumps(prod)
jload = json.loads(jdump)
regpat = re.compile("^\s+|\s*-| |_\s*|\s+$")
prodmatch = [x for x in regpat.split(jload["product_name"].lower()) if x]
manumatch = [x for x in regpat.split(jload["manufacturer"].lower()) if x]
modelmatch = [x for x in regpat.split(jload["model"].lower()) if x]
wordmatch = prodmatch + manumatch + modelmatch
#print (wordmatch)
#logging.debug('product first output')
with jsonlines.open('listings.jsonl') as listings:
for entry in listings:
jdump2 = json.dumps(entry)
jload2 = json.loads(jdump2)
wordmatch2 = [x for x in regpat.split(jload2["title"].lower()) if x]
#print (wordmatch2)
#logging.debug('listing first output')
contained = [x for x in wordmatch2 if x in wordmatch]
if contained:
print(contained)
#logging.debug('contained first match')
Code above splits up the words in the product_name, model, and manufacturer in the products file and tries to match strings from the listings file but I feel like this is too slow and there must be a better way to do it. Any help is appreciated