23

I have some values in the risk column that are neither, Small, Medium or High. I want to delete the rows with the value not being Small, Medium and High. I tried the following:

df = df[(df.risk == "Small") | (df.risk == "Medium") | (df.risk == "High")]

But this returns an empty DataFrame. How can I filter them correctly?

ArtDijk
  • 1,957
  • 6
  • 23
  • 31
  • 1
    I've tried to create a dataframe with such data, and your string of code works properly. Could you give more information about what contains in dataframe and how do you generate it? – Mikhail Elizarev Apr 27 '14 at 14:39
  • You requirement is a little unclear, if all your values can ever be small, mediu, or high and you want to drop rows that are any of these values then this will result in now rows so could you explain clearer what you require – EdChum Apr 27 '14 at 14:49
  • Hmm.. your code is correct so I think you need to post data and code that reproduces your problem – EdChum Apr 27 '14 at 15:04
  • For example, it would be useful to see what `df.risk.value_counts()` returns. – DSM Apr 27 '14 at 15:17
  • @EdChum. Your previous (now deleted post) had df = df[df.risk.isin(['Small','Medium','High'])]. That gave the desired result ! – ArtDijk Apr 27 '14 at 16:40
  • Before the advise by EdChum:In [154]: df.risk.value_counts() Out[154]: Small 23520 Medium 6164 High 402 None 251 After code from EdChum the 'None' was gone. :-) – ArtDijk Apr 27 '14 at 16:41
  • @ArtDijk OK I will undelete and edit my answer – EdChum Apr 27 '14 at 16:51
  • Could you post some data though and also as @DSM suggests the output from df.risk.value_counts() – EdChum Apr 27 '14 at 16:53
  • Do you want to just filter out 'None' values? – EdChum Apr 27 '14 at 17:09
  • @ EdChum. I want to filter out all none Small, Medium or High values. – ArtDijk Apr 27 '14 at 18:43
  • @ArtDijk well your code sample should've worked and it worked on my toy dataset, whilst my alternative suggestion works for you it'd be better to post some actual data if you can and the output from `df.risk.value_counts()` – EdChum Apr 27 '14 at 19:20

3 Answers3

34

I think you want:

df = df[(df.risk.isin(["Small","Medium","High"]))]

Example:

In [5]:
import pandas as pd
df = pd.DataFrame({'risk':['Small','High','Medium','Negligible', 'Very High']})
df

Out[5]:

         risk
0       Small
1        High
2      Medium
3  Negligible
4   Very High

[5 rows x 1 columns]

In [6]:

df[df.risk.isin(['Small','Medium','High'])]

Out[6]:

     risk
0   Small
1    High
2  Medium

[3 rows x 1 columns]
EdChum
  • 376,765
  • 198
  • 813
  • 562
4

Another nice and readable approach is the following:

small_risk = df["risk"] == "Small"
medium_risk = df["risk"] == "Medium"
high_risk = df["risk"] == "High"

Then you can use it like this:

df[small_risk | medium_risk | high_risk]

or

df[small_risk & medium_risk]
Rafael
  • 7,002
  • 5
  • 43
  • 52
1

You could also use query:

df.query('risk in ["Small","Medium","High"]')

You can refer to variables in the environment by prefixing them with @. For example:

lst = ["Small","Medium","High"]
df.query("risk in @lst")

If the column name is multiple words, e.g. "risk factor", you can refer to it by surrounding it with backticks ` `:

df.query('`risk factor` in @lst')

query method comes in handy if you need to chain multiple conditions. For example, the outcome of the following filter:

df[df['risk factor'].isin(lst) & (df['value']**2 > 2) & (df['value']**2 < 5)]

can be derived using the following expression:

df.query('`risk factor` in @lst and 2 < value**2 < 5')