0

I was wondering how would find estimated values based on several different categories. Two of the columns are categorical, one of the other columns contains two strings of interest and the last contain numeric values I have a csv file called sports.csv

import pandas as pd
import numpy as np

#loading the data into data frame
df = pd.read_csv('sports.csv')

I'm trying to find a suggested price for a Gym that have both baseball and basketball as well as enrollment from 240 to 260 given they are from region 4 and of type 1

Region  Type    enroll  estimates   price   Gym
2   1   377 0.43    40  Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
4   2   100 0.26    37  Baseball|Tennis
4   1   347 0.65    61  Basketball|Baseball|Ballet
4   1   264 0.17    12  Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
1   1   286 0.74    78  Swimming|Basketball
0   1   210 0.13    29  Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
0   1   263 0.91    31  Tennis
2   2   271 0.39    54  Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
3   3   247 0.51    33  Baseball|Hockey|Swimming|Cycling
0   1   109 0.12    17  Football|Hockey|Volleyball

I don't know how to piece everything together. I apologize if the syntax is incorrect I'm just beginning Python. So far I have:

import pandas as pd
import numpy as np

#loading the data into data frame
df = pd.read_csv('sports.csv')

#group 4th region and type 1 together where enrollment is in between 240 and 260
group = df[df['Region'] == 4] df[df['Type'] == 1] df[240>=df['Enrollment'] <=260 ]
#split by pipe chars to find gyms that contain both Baseball and Basketball
df['Gym'] = df['Gym'].str.split('|')
df['Gym'] = df['Gym'].str.contains('Baseball'& 'Basketball')

price = df.loc[df['Gym'], 'Price']

Should I do a groupby instead? If so, how would I include the columns Type==1 Region ==4 and enrollment from 240 to 260 ?

M3105
  • 519
  • 2
  • 7
  • 20
  • 1
    Possible duplicate of [pandas: filter rows of DataFrame with operator chaining](http://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining) – Merlin Jul 31 '16 at 01:21
  • You asked the same question here http://stackoverflow.com/questions/38677504/finding-specific-strings-within-a-column-and-finding-the-max-corresponding-to-th My answer to that question could be used to answer this one., also.. READ ALL THE ANSWERS. – Merlin Jul 31 '16 at 03:30

2 Answers2

0

You can create a mask with all your conditions specified and then use the mask for subsetting:

mask = (df['Region'] == 4) & (df['Type'] == 1) & \
       (df['enroll'] <= 260) & (df['enroll'] >= 240) & \
        df['Gym'].str.contains('Baseball') & df['Gym'].str.contains('Basketball')

df['price'][mask]
# Series([], name: price, dtype: int64)

which returns empty, since there is no record satisfying all conditions as above.

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thank you for your response, sorry if this is a dumb question, but how do I print it's output? When I run print df['Price'][mask] it just gives me Series([], Name: Price, dtype: int64) – M3105 Jul 31 '16 at 01:49
  • That's because there is nothing in your data frame satisfy all the conditions, and you get empty results. If you have a larger data frame or you modify your conditions so that there are some records satisfying the conditions, you should be able to get the results printed. – Psidom Jul 31 '16 at 01:56
  • 1
    I tried adding more values to make it true but it still gave me Series([], Name: Price, dtype: int64) – M3105 Jul 31 '16 at 02:02
0

I had to add an instance that would actually meet your criteria, or else you will get an empty result. You want to use df.loc with conditions as follows:

In [1]: import pandas as pd, numpy as np, io
In [2]: in_string = io.StringIO("""Region  Type    enroll  estimates   price   Gym
    ...: 2   1   377 0.43    40  Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
    ...: 4   2   100 0.26    37  Baseball|Tennis
    ...: 4   1   247 0.65    61  Basketball|Baseball|Ballet
    ...: 4   1   264 0.17    12  Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
    ...: 1   1   286 0.74    78  Swimming|Basketball
    ...: 0   1   210 0.13    29  Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
    ...: 0   1   263 0.91    31  Tennis
    ...: 2   2   271 0.39    54  Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
    ...: 3   3   247 0.51    33  Baseball|Hockey|Swimming|Cycling
    ...: 0   1   109 0.12    17  Football|Hockey|Volleyball""")

In [3]: df = pd.read_csv(in_string,delimiter=r"\s+")

In [4]: df.loc[df.Gym.str.contains(r"(?=.*Baseball)(?=.*Basketball)") 
    ...:        & (df.enroll <= 260) & (df.enroll >= 240) 
    ...:        & (df.Region == 4) & (df.Type == 1), 'price']
Out[4]: 
2    61
Name: price, dtype: int64

Note I used a regex pattern for contains that essentially acts as an AND operator for regex. You could simply have done another conjunction of .contains conditions for Basketball and Baseball.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172