4

I'm interested in classifying recipes programmatically based on a statistical analysis of various properties of the recipe. In other words, I want to classify a recipe as Breakfast, Lunch, Dinner or Dessert without any user input.

The properties I have available are:

  1. The recipe title (such as chicken salad)
  2. The recipe description (arbitrary text describing the recipe)
  3. The cooking method (the steps involved in preparing this recipe)
  4. Prep and cook times
  5. Each ingredient in the recipe, and its amount

The good news is I have a sample set of about 10,000 recipes that are already classified, and I can use these data to teach my algorithm. My idea is to look for patterns, such as if the word syrup appears statistically more frequently in breakfast recipes, or any recipe that calls for over 1 cup of sugar is 90% likely to be a dessert. I figure if I analyze the recipe across several dimensions, and then tweak the weights as appropriate, I can get something that's decently accurate.

What would be some good algorithms to investigate while approaching this problem? Would something like k-NN be helpful, or are there ones betters suited to this task?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Mike Christensen
  • 88,082
  • 50
  • 208
  • 326
  • How much programming effort are you willing to put in? The easiest (least programming) solution is to concatenate all these fields into one big text and run any text classification tools. The second approach, needs more involvement, needs you to create your own features from the data and run one or more classification algorithms: SVM, Boosting, KNN, Neural Nets, Decision Tree and so on. – ElKamina Feb 13 '12 at 18:23
  • @ElKamina - I'm looking for the latter method involving building my own algorithm. Mainly what I'm wanting to get out of this question is pointers to algorithms that would be most suitable for this type of problem, I'm not looking for any sample code or anything (the question is obviously much too broad for that!) – Mike Christensen Feb 13 '12 at 18:26
  • 2
    Once you have the features, you can easily experiment with many different classification algorithms with [Weka](http://www.cs.waikato.ac.nz/ml/weka/) and choose the one that best fits your requirements. – Lars Kotthoff Feb 13 '12 at 18:46
  • @LarsKotthoff - This Weka project looks pretty awesome! I will for sure check it out, at the very least I can get my data in this format and test out some various algorithms quickly. Thanks for the pointer! – Mike Christensen Feb 13 '12 at 19:00

3 Answers3

2

If I were to do it, I would try to do it like suggested by LiKao. I would first focus on the ingredients. I would establish a dictionnary of the words appearing in the Ingredients sections of the recipes, and cleanup the list in a supervised way to remove non-ingredient terms such as quantities and units.

Then I would resort to the Bayes theorem: your database allows you to compute the probability of having Eggs in a Breakfast and in a Dinner...; you will precompute those a priori probabilities. Then given an unknown recipy containing both Eggs and Marmalade, you can compute the probability of the meal being a Breakfast, a posteriori.

You can later enrich with other terms and/or taking quantities into account (number of Eggs per person)...

  • Good suggestions - Luckily, my database is already normalized in this way so I have a set dictionary of ingredients, and amounts/units are stored separately.. – Mike Christensen Feb 14 '12 at 00:25
  • If you are using the Bayes theorem like this, what exactly would be the difference to use a naive Bayes learner, either a self implemented one or one of the mainy available ones? Except that you are ignoring a priori probabilities of the meal types in your description this seems just to do what a naive bayes learner does. Still an upvote for the nice and short description of the naive bayes. – LiKao Feb 14 '12 at 09:33
  • @LiKao: no difference was intended. –  Feb 14 '12 at 11:30
1

Try various well known machine learning algorithms. I would suggest first using a Bayesian Classifier, since it is easy to implement and often works fairly well. If this does not work, then try something more complex, e.g. Neural Nets or SVMs.

The main Problem will be deciding on a set of features as input into your method. For this you will should look at which information is unique. For example if you have a recipe titled "Chicken Salad" the "chicken" part will not be of much interest because it is also present in the ingredients and simpler to gather from there. So you should try to find a set of keywords which are giving new information (i.e. the Salad part). Try to find a good set of keywords for this. This probably can be automatized somehow, but more likely you will be better of if you do it by hand, since it only needs to be done once.

The same goes for the description. Finding the correct set of features is always the hardest part for such a task.

Once you have your set of features, just train your algorithm on them and see how well it does. If you do not have much experience with Machine Learning have a look at the different methods to correctly test a ML algorithm (e.g. Leave N out testing etc).

LiKao
  • 10,408
  • 6
  • 53
  • 91
1

I think NN is probably an overkill for this. I would try classifying using a single perceptron "network" for each type of meal(Breakfast,Dinner), and let it go over the input and adjust the weight vector. every meaningful word found in the dataset can be the inputs of the network.. I would expect that to be enough for your needs. I used this method successfully to classify text before.

WeaselFox
  • 7,220
  • 8
  • 44
  • 75