0

So I'm using this kaggle dataset (https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions), and I need need to convert the contents of the ingredient IDs field to be actual lists rather than strings representing lists, such that I can then explode the ingredients and then create a matrix of recipes by ingredients (see pictures below).

When I use this code recipes_exploded["ingredient_ids"] = recipes_exploded['ingredient_ids'].apply(lambda x : ast.literal_eval(x)) in Dask, I get the error that this is a custom function that Dask cannot use.

What would be alternatives (code) to solving this issue? I found this other post dealing with a similar issue (use ast.literal_eval with dask Series), and comments suggested that it was just poor creation of a dataset and serialization should've been used. The problem with this is that it isn't my dataset; my primary coding language is in R, and I'm trying to figure out how to remedy the issue.

This is what I want to make: enter image description here

This is what I'm trying to fix:

enter image description here enter image description here

This is my code in Pandas to go from the second picture to the first:

import ast
recipes_exploded = recipes.copy(deep=True)
recipes_exploded["ingredient_ids"] = recipes_exploded['ingredient_ids'].apply(lambda x : ast.literal_eval(x)) #making each ingedient_id, it's own observation, such that it can be exploded.
recipes_exploded = recipes_exploded.explode(column="ingredient_ids", ignore_index=True)

recipes_exploded[['counts']] = recipes_exploded\
.groupby(by = ['ingredient_ids'], as_index = False)['ingredient_ids'].count()

recipes_exploded = recipes_exploded[['recipe_id', 'ingredient_ids', 'counts']]

recipes_exploded[['count']] = '1'

recipes_exploded = recipes_exploded.drop('counts', axis = 1)

recipes_exploded = recipes_exploded.pivot_table(index = 'recipe_id', columns='ingredient_ids', values = 'count', fill_value = "0", aggfunc='sum')

James
  • 459
  • 2
  • 14
  • 1
    If it's always a list of numbers, you can use `json.loads()`. – Barmar May 24 '22 at 17:07
  • Awesome!...so this is doing what I want in Python: ```.apply(lambda x : json.loads(x)```. If it works in Dask, I will be very happy! – James May 24 '22 at 17:18
  • Shoot! I'm still getting: "You have supplied a custom function and Dask is unable to determine the type of output that that function returns." So it must be the .apply(lambda...) that is the issue. Is there a workaround for that in Dask? – James May 24 '22 at 17:26
  • 2
    @James there is no non-custom function that is going to solve this problem. Looking at the dask docs, it seems like you can *specify* the output type manually, in this case, presumably, it will be `object` but I'm not familiar with dask – juanpa.arrivillaga May 24 '22 at 17:29
  • 1
    @James Try adding `meta=('techniques', object)` as keyword argument to `DataFrame.apply` – user2246849 May 24 '22 at 17:33
  • Thanks so much! That worked: ```.apply(lambda x : json.loads(x), meta=('techniques', object)``` It did seem to take much longer than in pandas/python (which seemed odd): is that just because of the partitions? – James May 24 '22 at 18:01
  • if your dataset is small enough to fit comfortably in memory, you probably shouldn't be using dask. any distributed processing engine will involve adding non-trivial overhead to the operation, and while it does allow you to leverage multiple cores (or perhaps a distributed cluster) and to process larger-than-memory data iteratively, this is often not worth it for small tasks. – Michael Delgado May 24 '22 at 18:56
  • Lol...totally agree, alas, the professor feels otherwise. It is about 600,000 * 8000, so not small, but the project has to be completed in Dask or Pyspark. I also think part of it was to get comfortable dealing with partitioned data such that parallelization can occur. – James May 24 '22 at 21:09

0 Answers0