I'm interested in classifying recipes programmatically based on a statistical analysis of various properties of the recipe. In other words, I want to classify a recipe as Breakfast
, Lunch
, Dinner
or Dessert
without any user input.
The properties I have available are:
- The recipe title (such as chicken salad)
- The recipe description (arbitrary text describing the recipe)
- The cooking method (the steps involved in preparing this recipe)
- Prep and cook times
- Each ingredient in the recipe, and its amount
The good news is I have a sample set of about 10,000 recipes that are already classified, and I can use these data to teach my algorithm. My idea is to look for patterns, such as if the word syrup appears statistically more frequently in breakfast recipes, or any recipe that calls for over 1 cup of sugar is 90% likely to be a dessert. I figure if I analyze the recipe across several dimensions, and then tweak the weights as appropriate, I can get something that's decently accurate.
What would be some good algorithms to investigate while approaching this problem? Would something like k-NN be helpful, or are there ones betters suited to this task?