Standardizing Ingredients for a Database

Question

I would love anyone's help on this, i've been placed on this project at work and I'm a bit lost on where to go and my manager doesn't know how to do this either.

Basically we have a list of 3000+ Hair products with all the ingredients. What I need to do is to create a database with ALL of these ingredients and standardize the names.

Here are the problems I face:

Ingredients are slightly spelt differently depending on country of production
Alot of the ingredients are in different formats, for example:

Product A: ingredient A,Ingredient B,Ingredient C
Product B: ingredient A, Ingredient B, Ingredient C
Product C: ingredient A ,Ingredient B ,Ingredient C
Product D: ingredient A.IngredientB.Ingredient C

How would I be able to:

1) Standardize all my existing data 2) Separate each ingredient so I have a list 3) Create a excel model that when I want to add an ingredient, it will verify it with my existing list? (This one I know how to do)

Thanks all! Would love some suggestions!

You could use Data --> Text to Columns, using comma as a delimiter which would break out the ingredients into one per column. Does that help get you started? Also, how do you define "standardize all my existing data"? What would make it standard, simply having the same delimiter? — BruceWayne, Jan 06 '17 at 08:42
By "standardize" I meant by having the ingredient lists in a consistent format and also have same ingredients with different spellings corrected. — Kelvin Chua, Jan 06 '17 at 09:00
Prior to doing the Text to Columns suggested by @BruceWayne, you might need to change all `.`s to `,`s (assuming there are no `.`s within the actual ingredient names) - that will get rid of one more source of difference. — YowE3K, Jan 06 '17 at 09:09
Delimiting first would be the way to go, then i'd sort in alphabetical order and you might be able to see by eye where the next ingredient starts then. Otherwise I'd use a formula checking for matches with LEFT() RIGHT() etc to ascertain matches. and sum up what they return. Perhaps delimit the info first and provide a small sample of data we can play around with for you? Remember to @ me so I get a notification. Are you trying to achieve a total of each ingredient? — Glitch_Doctor, Jan 06 '17 at 09:16

Ghislain Fourny · Answer 1 · 2017-01-06T09:30:45.450

From a data modeling perspective, there are two entities: products and ingredients, and they are in an N-to-M relationship. This can fit in tree relational tables, that you can store as three sheets in Excel*: a list of products, a list of ingredients, a list of relationships between products and ingredients. But that would be the final picture of course.

Now, getting these sheets populated (ETL) requires some work because of the low data quality (discrepancy in names). Part of it can probably be automated with sed scripts directly on the input as text (standardizing separators), or as BruceWayne indicates, using all possible delimiters in the CSV import. But part of it will probably require human intervention (fixing the different names).

Using CSV import, selecting delimiters and fixing spaces should probably get you there:

|---------|-------------|------------|-------------|
|Product A| ingredient A|Ingredient B|Ingredient C |
|Product B| ingredient A|Ingredient B| Ingredient C|
|Product C| ingredient A|Ingredient B| Ingredient C|
|Product D| ingredient A|IngredientB |Ingredient C |
|---------|-------------|------------|-------------|

What I would then suggest is to build a separate mapping of a standard name to all the other equivalent names that could occur in the input. This mapping may be partly built automatically (obvious case changes based on your observations, etc), but will probably require manual work and trial and error.

A mapping could look like

|-------------|-------------|
| From        | To          |
|-------------|-------------|
|ingredient a |Ingredient A | 
| ingredientA |Ingredient A |
| ingredient b|Ingredient B |
| IngredientB |Ingredient B |
|-------------|-------------|

Then, you should be able to map the original table to a standardized table with lookup functions:

|---------|------------|------------|------------|
|Product A|Ingredient A|Ingredient B|Ingredient C|
|Product B|Ingredient A|Ingredient B|Ingredient C|
|Product C|Ingredient A|Ingredient B|Ingredient C|
|Product D|Ingredient A|Ingredient B|Ingredient C|
|---------|------------|------------|------------|

And to map it to a list of ingredients (using duplicate elimination on the second column of the mapping table):

|------------|
|Ingredient A|
|Ingredient B|
|Ingredient C|
|------------|

(and the same for products).

You should also be able to populate the normalized relationship:

|---------|------------|
|Product A|Ingredient A|
|Product A|Ingredient B|
|Product A|Ingredient C|
|Product B|Ingredient A|
|Product B|Ingredient B|
|Product B|Ingredient C|
|Product C|Ingredient A|
|Product C|Ingredient B|
|Product C|Ingredient C|
|Product D|Ingredient A|
|Product D|Ingredient B|
|Product D|Ingredient C|
|---------|------------|

Disclaimer: from a database perspective, it is more advisable to use a relational database (possibly with Access) if you can than Excel, even though many do use Excel to store data. Wrong manipulations in Excel, such as deleting a cell and shifting rows and columns, can lead to severe errors in the data that can have disastrous consequences if it is used in production. Excel is great and easy to use even with no database background, but use it with care!

In the end it's all about BOMs http://stackoverflow.com/questions/17651424/bill-of-materials-database-model — Storax, Jan 06 '17 at 09:31

Standardizing Ingredients for a Database

1 Answers1