11

I am looking to parse unstructured product titles like “Canon D1000 4MP Camera 2X Zoom LCD” into structured data like {brand: canon, model number: d1000, lens: 4MP zoom: 2X, display type: LCD}.

So far I have:

  1. Removed stopwords and cleaned up (remove characters like - ; : /)
  2. Tokenizing long strings into words.

Any techniques/library/methods/algorithms would be much appreciated!

EDIT: There is no heuristic for the product titles. A seller can input anything as a title. For eg: 'Canon D1000' can just be the title. Also, this exercise is not only for camera datasets, the title can be of any product.

stealthspy
  • 659
  • 1
  • 6
  • 12
  • Do you have any training data? Say product specifications for 1000 products? – Jirka Aug 29 '13 at 16:05
  • I have a lot of training data. I need to perform this for 100 million items, but right now I am trying to build a prototype with ~10,000 products related to Cameras. – stealthspy Aug 29 '13 at 16:21
  • 1
    I am trying to solve the same problem. I have ~50K products, all of them unstructured, no training data. The first step for me is to find data for training, meaning products with defined attributes: brand, model etc. Products belong to electronics (phones, laptops, cameras). Any suggestions where to find products with attributes? – dzeno Dec 01 '14 at 10:13

5 Answers5

7

Since you have a lot of training data (I assume you have a lot of pairs title + structured json specification), I would try to train a Named Entity Recognizer.

For example, you can train the Stanford NER. See this FAQ entry explaining how to do it. Obviously, you will have to fiddle with the parameters as product titles are not exactly sentences.

You will need to prepare the training data but that should not be that hard. You need two columns, word and answer and you can add the the tag column (but I am not sure what the accuracy of standard POS taggerwould be as it is rather non-typical text). I would simply extract the value of the answer column from the associated json specification, there will be some ambiguity, but I think it will be rare enough so you can ignore it.

Jirka
  • 4,184
  • 30
  • 40
4

Having developed a commercial analyzer of this kind, I can tell you that there is no easy solution for this problem. But there are multiple shortcuts, especially if your domain is limited to cameras/electronics.

Firstly, you should look at more sites. Many have product brand annotated in the page (proper html annotations, bold font, all caps in the beginning of the name). Some sites have entire pages with brand selectors for search purposes. This way you can create a pretty good starter dictionary of brand names. Same with product line names and even with models. Alphanumeric models can be extracted in bulk by regular expressions and filtered pretty quickly.

There are plenty of other tricks, but I'll try to be brief. Just a piece of advice here: there is always a trade-off between manual work and algorithms. Always keep in mind that both approaches can be mixed and both have return-on-invested-time curves, which people tend to forget. If your goal is not to create an automatic algorithm to extract product brands and models, this problem should have limited time budget in your plan. You can realistically create a dictionary of 1000 brands in a day, and for decent performance on known data source of electronic goods (we are not talking Amazon here or are we?) a dictionary of 4000 brands may be all you need for your work. So do the math before you invest weeks into the latest neural network named entity recognizer.

Alex Nevidomsky
  • 668
  • 7
  • 14
3

I agree there is no 100% success method. A possible approach would be to train a custom NER(Named Entity Recognition) with some manually annotated data. The labels would be: BRAND/MODEL/TYPE. Also a common way to filter model names/brands is to use a dictionary. Brands/models usually are non-dictionary words.

bogs
  • 2,286
  • 18
  • 22
1

If you are only getting titles (like amazon products), then you can view this as a sentence and considering sequential labeling.

Depending on whether the attributes are given or unknown ( Attributes are like brand, model etc.), there are several issues here:

1: If this is what given then the problem is "easy" and you can use any "sequential labeling" methods to work out. Methods include CRF (conditional random fields) and Markov Models (HMM, MEMM, etc)

2: If not, then you need to extract (attribute, value) pairs the same way as parsing (dependency parsing, full parsing). But I am wondering if this is feasible since there is really little knowledge about the attributes beforehand. Another possibility is that given lots of external information (either the reviews and the product descriptions), you possibly can figure out those attributes and then extract the pairs from the titles. Ex. you find lots of correlation of "brand" and "canon" in reviews, then spotting the word "canon" from title with camera somewhere as well, you know this is a value for "brand".

dragonxlwang
  • 462
  • 5
  • 13
  • I think i need to mention that there is no heuristic for the product title. How would sequential labeling work in this case? Nothing stops the seller to input "D1000 4MP Camera Canon 2X LCD Zoom" – stealthspy Aug 28 '13 at 20:30
  • 2
    then this is a much harder problem (see case two). Leveraging reviews/description would help. Otherwise, if you are only working in camera products (the data is not sparse), then probably sequential labeling unsupervised can help (HMM) but then you can know only "canon" and "nikon" are of the same attribute, but it is still hard to name it (where "brand" come from?) – dragonxlwang Aug 28 '13 at 20:36
0

You might have more success with a neural net to parse such free text, but you will fail with just plain text parsing, because many of the words need a context you don't have.

However, depending on the level of precision you want to achieve you can come up with a partial solution (which then requires human post-treatment). Or force at least a minimum structure on the input (like product names always must follow a certain pattern). This way you have a much better start since you can better identify the product which should give you enough context information to understand the remaining input.

There's definitely no 100% solution possible (not even with a neural net), I guess.

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181