1

I want to find only the rows and only the columns with field values within a given set of values.

I can get the rows, by I can't limit the columns.

Say I have this dataframe:

print(df)

# year     1970  1971  1972  1973  1974  1975  1976  1977  1978
# country  
# Malawi    NaN   NaN   NaN   123   NaN   234   NaN   NaN   NaN
# OtherC    NaN   NaN   NaN   124   NaN   234   NaN   NaN   NaN
# OtherD    NaN   NaN   NaN   124   NaN   235   NaN   NaN   NaN

What I want to return is the rows and columns containing 123 or 234:

# year     1973  1975
# country  
# Malawi    123   234
# OtherC    124   234

I can do this, returning only the rows with the given values, but not selecting the columns:

print(df[df.isin([123, 234]).any(axis=1)])

# year     1970  1971  1972  1973  1974  1975  1976  1977  1978
# country  
# Malawi    NaN   NaN   NaN   123   NaN   234   NaN   NaN   NaN
# OtherC    NaN   NaN   NaN   124   NaN   234   NaN   NaN   NaN

But, when I try either of these two statements, I get an error:

print(df[df.isin([123, 234]).any(axis=1)]\
[df.isin([123, 234]).any(axis=0)])

print(df[df.isin([123, 234]).any(axis=0)]\
[df.isin([123, 234]).any(axis=1)])

...
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

Or, this (with the actual dataset):

print(df[[df.isin([123, 234]).any(axis=1)],\
[df.isin([123, 234]).any(axis=0)]])

# The following output is from using the real dataset, so it
# includes more rows and columns, and the sought-after values
# are in different locations, but you get the idea:

# TypeError: '([country
# Afghanistan            False
# Albania                False
# Andorra                False
# Angola                 False
# Antigua and Barbuda    False
#                        ...  
# Uruguay                False
# Vanuatu                False
# Venezuela              False
# Vietnam                False
# Zimbabwe               False
# Length: 160, dtype: bool], [year
# 1970    False
# 1971    False
# 1972    False
# 1973    False
# 1974    False
# 1975    False
# 1976    False
# 1977    False
# 1978    False
# 1979    False
# 1980    False
# 1981    False
# 1982    False
# 1983    False
# 1984    False
# 1985    False
# 1986    False
# 1987    False
# 1988    False
# 1989    False
# 1990    False
# 1991    False
# 1992    False
# 1993    False
# 1994    False
# 1995    False
# 1996    False
# 1997    False
# 1998    False
# 1999     True
# 2000     True
# 2001    False
# 2002    False
# 2003    False
# 2004    False
# 2005    False
# 2006    False
# 2007    False
# 2008    False
# 2009    False
# 2010    False
# 2011    False
# 2012    False
# 2013    False
# 2014    False
# 2015    False
# 2016    False
# 2017    False
# 2018    False
# dtype: bool])' is an invalid key
Kaleb Coberly
  • 420
  • 1
  • 4
  • 19

1 Answers1

3

You take any(axis=1) to get the rows, and any(axis=0) to get the columns. Remember to avoid index chaining when possible (e.g. df[][])

valids = df.isin([123,234])
df.loc[valids.any(axis=1), valids.any(axis=0)]

Output:

year     1973  1975
country            
Malawi    123   234
OtherC    124   234
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks! You wouldn't believe how many different ways I tried to get this! – Kaleb Coberly Oct 29 '20 at 19:55
  • I'm curious why to avoid index chaining, Quang. – Kaleb Coberly Oct 30 '20 at 01:29
  • 1
    See [this question](https://stackoverflow.com/questions/41253170/whats-the-alternative-to-pandas-chain-indexing) and the answer under it. Plus, with index chaining, you might end up with a copy of the data, and modifying that might fail and raise a SettingwithCopyWarnjng – Quang Hoang Oct 30 '20 at 01:35
  • Thanks, I'll check it out. When Python does and does not make copies is definitely on my radar right now. Seems kind of arbitrary. – Kaleb Coberly Oct 30 '20 at 02:13