I am looking to parse unstructured product titles like “Canon D1000 4MP Camera 2X Zoom LCD” into structured data like {brand: canon, model number: d1000, lens: 4MP zoom: 2X, display type: LCD}
.
So far I have:
- Removed stopwords and cleaned up (remove characters like
-
;
:
/
) - Tokenizing long strings into words.
Any techniques/library/methods/algorithms would be much appreciated!
EDIT: There is no heuristic for the product titles. A seller can input anything as a title. For eg: 'Canon D1000' can just be the title. Also, this exercise is not only for camera datasets, the title can be of any product.