0

I am trying to match listings of products in a JSON lines format with products in another file also in JSON format. This is sometimes called Record Linkage, Entity Resolution, Reference Reconciliation, or just matching.

The goal is to match product listings from a 3rd party retailer, e.g. “Nikon D90 12.3MP Digital SLR Camera (Body Only)” against a set of known products, e.g. “Nikon D90”.

Details

Data Objects

Product

{
"product_name": String // A unique id for the product
"manufacturer": String
"family": String // optional grouping of products
"model": String
"announced-date": String // ISO-8601 formatted date string, e.g. 2011-04-28T19:00:00.000-05:00
}

Listing

{
"title": String // description of product for sale
"manufacturer": String // who manufactures the product for sale
"currency": String // currency code, e.g. USD, CAD, GBP, etc.
"price": String // price, e.g. 19.99, 100.00
}

Result

{
"product_name": String
"listings": Array[Listing]
}

Data Contains two files: products.txt – Contains around 700 products listings.txt – Contains about 20,000 product listings

Current code (using python):

import jsonlines
import json
import re
import logging, sys

logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)

with jsonlines.open('products.jsonl') as products:
  for prod in products:
    jdump = json.dumps(prod)
    jload = json.loads(jdump)
    regpat = re.compile("^\s+|\s*-| |_\s*|\s+$")
    prodmatch = [x for x in regpat.split(jload["product_name"].lower()) if x]
    manumatch = [x for x in regpat.split(jload["manufacturer"].lower()) if x]
    modelmatch = [x for x in regpat.split(jload["model"].lower()) if x]
    wordmatch = prodmatch + manumatch + modelmatch
    #print (wordmatch)
    #logging.debug('product first output')
    with jsonlines.open('listings.jsonl') as listings:
      for entry in listings:
        jdump2 = json.dumps(entry)
        jload2 = json.loads(jdump2)
        wordmatch2 = [x for x in regpat.split(jload2["title"].lower()) if x]
        #print (wordmatch2)
        #logging.debug('listing first output')
        contained = [x for x in wordmatch2 if x in wordmatch]
        if contained:
          print(contained)
        #logging.debug('contained first match')

Code above splits up the words in the product_name, model, and manufacturer in the products file and tries to match strings from the listings file but I feel like this is too slow and there must be a better way to do it. Any help is appreciated

  • What's working, what isn't? If you want an answer, you have to ask a question. – Chris Johnson Oct 07 '17 at 21:58
  • The nested for loops go through all the data but my matches aren't very accurate or precise for that matter. It also takes too long to parse through – Joseph Joestar Oct 07 '17 at 22:04
  • You might want to find a database with full text search and use that. There are also online resources about text normalization which can improve this code or your use of a full text search database. I know this is open ended but it's a big field, pick a corner and start reading. :) – ldrg Oct 07 '17 at 22:12

1 Answers1

0

First, I'm not sure what's going on with the dumps() followed by the loads(). If you can find a way to avoid serializing and unserializing everything on each iteration that'll be a big win as it seems totally redundant from the code you've posted here.

Second, the listings stuff: as it doesn't ever change, why not parse it once before the loop into some data structure (possibly a dict mapping the contents of wordmap2 to the listing it was derived from) and reusing that structure while parsing products.json?

Next: if there's a way to rejigger this to use multiprocessing I highly suggest you do so. You're entirely bound on CPU here and you can easily get this to run in parallel on all of your cores.

Finally, I gave it a shot with some fancy regex shenanigans. The goal here is to push as much logic into the regex as I could under the thinking that re is implemented in C and thus will be more performant than doing all this string work in Python.

import json
import re

PRODUCTS = """
[
{
"product_name": "Puppersoft Doggulator 5000",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5000",
"announced-date": "ymd"
},
{
"product_name": "Puppersoft Doggulator 5001",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5001",
"announced-date": "ymd"
},
{
"product_name": "Puppersoft Doggulator 5002",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5002",
"announced-date": "ymd"
}
]
"""


LISTINGS = """
[
{
"title": "Doggulator 5002",
"manufacturer": "Puppersoft",
"currency": "Pupper Bux",
"price": "420"
},
{
"title": "Doggulator 5005",
"manufacturer": "Puppersoft",
"currency": "Pupper Bux",
"price": "420"
},
{
"title": "Woofer",
"manufacturer": "Shibasoft",
"currency": "Pupper Bux",
"price": "420"
}
]
"""

SPLITTER_REGEX = re.compile("^\s+|\s*-| |_\s*|\s+$")
product_re_map = {}
product_re_parts = []

# get our matching keywords from products.json
for idx, product in enumerate(json.loads(PRODUCTS)):
    matching_parts = [x for x in SPLITTER_REGEX.split(product["product_name"]) if x]
    matching_parts += [x for x in SPLITTER_REGEX.split(product["manufacturer"]) if x]
    matching_parts += [x for x in SPLITTER_REGEX.split(product["model"]) if x]

    # store the product object for outputting later if we get a match
    group_name = 'i{idx}'.format(idx=idx)
    product_re_map[group_name] = product
    # create a giganto-regex that matches anything from a given product.
    # the group name is a reference back to the matching product.
    # I use set() here to deduplicate repeated words in matching_parts.
    product_re_parts.append("(?P<{group_name}>{words})".format(group_name=group_name, words="|".join(set(matching_parts))))
# Do the case-insensitive matching in C code
product_re = re.compile("|".join(product_re_parts), re.I)

for listing in json.loads(LISTINGS):
    # we match against split words in the regex created above so we need to
    # split our source input in the same way
    matching_listings = []
    for word in SPLITTER_REGEX.split(listing['title']):
        if word:
            product_match = product_re.match(word)
            if product_match:
                for k in product_match.groupdict():
                    matching_listing = product_re_map[k]
                    if matching_listing not in matching_listings:
                        matching_listings.append(matching_listing)
    print listing['title'], matching_listings
ldrg
  • 4,150
  • 4
  • 43
  • 52