-2

I have the following data:

data = {
    "index": [1, 2, 3, 4, 5],
    "name": ["A", "A", "B", "B", "B"],
    "type": ['s1', 's2', 's1', 's2', 's3'],
    'value': [20, 10, 18, 32, 25]
}
df = pd.DataFrame(data)

I need to check if the value under same name follow constraint (say there only three type and not all exist under same name): s1 < s2 < s3, which means, under same name, if the value of s1 is smaller than s2 or s3, then return True, if s2 is smaller than s3, then return True. Otherwise, return False or NaN. Here is the output I expected:

    index   name    type    value   result
0     1      A       s1      20      False
1     2      A       s2      10        
2     3      B       s1      18      True
3     4      B       s2      32      False
4     5      B       s3      25        

How can I do it in Python? Thanks for your help.

ah bon
  • 9,293
  • 12
  • 65
  • 148
  • 2
    what have you already try? please add some code – Naor Tedgi Dec 29 '18 at 07:11
  • Why are there dashes in some rows and `False`s in some other rows? What is the formula/algorithm for calculating each `result`? – DYZ Dec 29 '18 at 07:15
  • @DYZ For instance, under A there are only s1. It would return dash or NaN if you like. – ah bon Dec 29 '18 at 07:18
  • Your question is unclear. _for instance_ isn't good enough. What _exactly_ is the condition for `True`, `False`, and dash - for each of the outcomes separately? Once you have the formula, it is easy to code it. – DYZ Dec 29 '18 at 07:19
  • @DYZ A s1 return False because s1 is not smaller than s2 in example. Same reason for B s2. – ah bon Dec 29 '18 at 07:20
  • Ok, what about `True` and a dash? Also, can you have more than one `s1`, `s2` or `s3` per name? – DYZ Dec 29 '18 at 07:21
  • @DYZ group by `name`, now say there are three type `s1, s2 and s3`, then check if `s1` is less than `s2` if yes state `True` else `False`, now take next pair, `s2` and `s3`. Do same with this pair. Since we don't have any next pair to form with `s3` hence `-` – meW Dec 29 '18 at 07:23
  • 1
    @meW That' just an educated guess. I'd rather know what the OP has in mind. – DYZ Dec 29 '18 at 07:25
  • Sorry@DYZ, The formula for calculating each result is here: s1 < s2 < s3 and I have only one s1, s2 or s3 per name. – ah bon Dec 29 '18 at 07:52

1 Answers1

1

Try:

#Use pd.Categorical to ensure sorting if column is not lexicographical ordered.
df['type'] = pd.Categorical(df['type'], ordered=True, categories=['s1','s2','s3'])

df['result'] = df.sort_values('type').groupby('name')['value'].diff(-1)

df['result'] = df['result'].lt(0).mask(df['result'].isna(),'')

df

Output:

   index name type  value result
0      1    A   s1     20  False
1      2    A   s2     10       
2      3    B   s1     18   True
3      4    B   s2     32  False
4      5    B   s3     25       
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    why `sort_values('type')`. Answer still comes without it. Am I missing something? – meW Dec 29 '18 at 11:36
  • 2
    If the dataframe isn't sored by type, then diff will not work correctly. Diff(-1) takes the current row and subracts the next row regardless of sort. So, to get diff the perform as expected wit s1 – Scott Boston Dec 29 '18 at 11:40