-1

lets say i have 2 csv files (very large files),

  • the first file represents restaurants and have 6 attributes restaurant_id, name,star_rating,city,zone,closed

enter image description here

  • the second file represents the categories of the restaurants and have 2 attributes restaurant_id and category enter image description here

So, what i want to do is basically add a column called zone_categories_intersection to my features that tells me the number of restaurants in the same area (zone) that share at least one category with the restaurant in question.

Since it's the first time i use the pandas librairy, i have a little trouble getting fluent when manipulating tables. I did something like this to figure out the number of restaurants in the area associated with the restaurant in question and add it to my features column.


restaurants['nb_restaurants_zone'] = restaurants.groupby('zone')['zone'].transform('size')
restaurants.head()

features = restaurants[['restaurant_id', 'moyenne_etoiles', 'ville', 'zone', 'ferme', 'nb_restaurants_zone']].copy()
features.head()

#edit
merged = restaurants.merge(categories, on='restaurant_id')
merged.head()

enter image description here

I thought about adding the category.csv file and merge it with restaurant and map the categorys with the corresponding id's and then figure out a way to apply the second condition (that share at least one category with the restaurant in question)... but i dont really know how to do any of those things

Thank you

Lynn
  • 121
  • 8
  • 25
  • @RichieV what i want to do is basically add a column to my features that tells me the number of restaurants in the same area (zone) that share at least one category of food with the restaurant in question. I dont know how to create a table in here to show. you but it's pretty simple. – Lynn Sep 21 '20 at 19:25
  • @RichieV ohh those tables you see are just the Heads, they are much larger ^^ – Lynn Sep 21 '20 at 19:29
  • and please, always show the tables in a code block, leave the images for things that cannot be shared with text as charts and such – RichieV Sep 21 '20 at 19:30
  • @RichieV yes nb_restaurants_zone is already showing how many restaurants are in the same zone, but i want to know he number of restaurants in the same area (zone) that share at least one category of food with the restaurant in question. – Lynn Sep 21 '20 at 19:30
  • I see, please change the images for code blocks including `df.head()` and `df2[df2['restaurant_id'].isin(df.head()['restaurant_id'])]` – RichieV Sep 21 '20 at 19:40
  • @RichieV i dont know how to add tables in those code blocks – Lynn Sep 21 '20 at 19:42
  • copy-paste the table, select the pasted text, ctrl+k (or select "Code sample" from the mini-toolbar) – RichieV Sep 21 '20 at 19:44

1 Answers1

1

Try this

# sample data
# (it's not exactly your provided data
# but it is better to show how the code works)
# please always provide a callable line of code
# you could get it with `df.head().to_dict('split')`
rest = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Denny\'s', 'Ike\'s Love & Sandwiches', 'Midori Japanese',
        'Pho U', 'John & Sons Oysters'],
    'avg_stars': [2.5, 4, 3.5, 3.5, 4],
    'city': ['Las Vegas', 'Phoenix', 'Calgary', 'Toronto', 'Toronto'],
    'zone': ['a', 'a', 'b', 'b', 'a']
})
cats = pd.DataFrame([
    [1, ['Breakfast', 'Dinners', 'American']],
    [2, ['Sandwiches', 'American']],
    [3, ['Japanese']],
    [4, ['Japanese']],
    [5, ['American', 'Seafood']]
], columns=['id', 'category']).explode('category')

The code

# add zone to categories dataframe
cats2 = cats.merge(rest[['id', 'zone']], on='id')

# add count for zone & category
cats2['zone_cat_count'] = (
    cats2.groupby(['zone', 'category'])
    .transform('count')
)

# merge with rest dataframe
rest = rest.merge(
    cats2.groupby('id')['zone_cat_count'].max()
    , on='id'
)

Output

   id                     name  avg_stars       city zone  zone_cat_count
0   1                  Denny's        2.5  Las Vegas    a               3
1   2  Ike's Love & Sandwiches        4.0    Phoenix    a               3
2   3          Midori Japanese        3.5    Calgary    b               2
3   4                    Pho U        3.5    Toronto    b               2
4   5      John & Sons Oysters        4.0    Toronto    a               3
RichieV
  • 5,103
  • 2
  • 11
  • 24
  • Thanks you i tried the steps with my tables and it seems to work. Is it fine as well if i had added the categories to the rest dataframe (not added zone to categories like you did) and did the count there with .groupby('id')['zone_cat_count'].max(), on='id' ? Also, why did you use .max? – Lynn Sep 21 '20 at 20:26
  • For displaying tables correctly in here, so i just use `df.head().to_dict('split')` on what i want to share and copy the text directly on a code block here? – Lynn Sep 21 '20 at 20:28
  • 1
    that would give the first five rows from the dataframe, see [how to provide a relevant sample](https://stackoverflow.com/help/minimal-reproducible-example)... maybe the first 5 rows don't show the complexity of your data, in that case include more rows with `df.head(20)` or create a new toy dataframe – RichieV Sep 21 '20 at 20:31
  • 1
    for example the first 5 rows of the categories dataframe would not be enough for this sample, you would need all the relevant `restaurant_id` – RichieV Sep 21 '20 at 20:33
  • Is it fine as well if i had added the categories to the rest dataframe (not added zone to categories like you did) and did the count there with .groupby('id')['zone_cat_count'].max(), on='id' ? and, why did you use .max? – Lynn Sep 21 '20 at 20:39
  • when you group `cats2` by `id` you need an aggregation column, and in this case you are looking for the maximum number of shared categories... as for the zone stuff, if it is not what you need then I just don't understand your question, hope you can solve your problem with what I've provided, best of luck – RichieV Sep 21 '20 at 20:43