0

If I have a pd.DataFrame that looks like:

new_df = []
for i in range(10):
    df_example = pd.DataFrame(np.random.normal(size=[10,1]))
    cols = [round(np.random.uniform(low=0,high=10)),round(np.random.uniform(low=0,high=10)),
        round(np.random.uniform(low=0,high=10)),round(np.random.uniform(low=0,high=10))]
    keys = ['A','B','C','D']
    new_ix = pd.MultiIndex.from_tuples([cols],names=keys)
    df_example.columns = new_ix
    new_df.append(df_example)
new_df = pd.concat(new_df,axis=1)

Which could yield something like: enter image description here

Now, if I want where C=4 and A=1 I can do:

df.xs(axis=1,level=['A','C'],key=[1,4])

How do I express if I want:

  1. C in [4,2] and A in [5,2]
  2. C in [4,2] or A in [5,2]
jason m
  • 6,519
  • 20
  • 69
  • 122

1 Answers1

0

To the best of my knowledge, you can't use anything but tuples for key parameter in xs, so such queries are not possible.

The next best thing is to define helper functions for that purpose, such as the following:

def xs_or(df: pd.DataFrame, params: dict[str, list[int]]) -> pd.DataFrame:
    """Helper function.

    Args:
        df: input dataframe.
        params: columns/values to query.

    Returns:
        Filtered dataframe.

    """
    df = pd.concat(
        [
            df.xs(axis=1, level=[level], key=(key,))
            for level, keys in params.items()
            for key in keys
        ],
        axis=1,
    )
    for level in params.keys():
        try:
            df = df.droplevel([level], axis=1)
        except KeyError:
            pass
    return df


def xs_and(df: pd.DataFrame, params: dict[str, list[int]]) -> pd.DataFrame:
    """Helper function.

    Args:
        df: input dataframe.
        params: columns/values to query.

    Returns:
        Filtered dataframe.

    """
    for level, keys in params.items():
        df = xs_or(df, {level: keys})
    return df

And so, with the following dataframe named df:

A         4         7         3         1         7         9         4         0         3         9
B         6         7         4         6         7         5         8         0         8         0
C        2         10        5         2         9         9         4         3         4         5 
D        0         1         7         3         8         3         6         7         9         10
0 -0.199458  1.155345  1.298027  0.575606  0.785291 -1.126484  0.019082  1.765094  0.034631 -0.243635
1  1.173873  0.523277 -0.709546  1.378983  0.266661  1.626118  1.647584 -0.228162 -1.708271  0.111583
2  0.321156  0.049470 -0.611111 -1.238887  1.092369  0.019503 -0.473618  1.804474 -0.850320 -0.217921
3  0.339307 -0.758909  0.072159  1.636119 -0.541920 -0.160791 -1.131100  1.081766 -0.530082 -0.546489
4 -1.523110 -0.662232 -0.434115  1.698073  0.568690  0.836359 -0.833581  0.230585  0.166119  1.085600
5  0.020645 -1.379587 -0.608083 -1.455928  1.855402  1.714663 -0.739409  1.270043  1.650138 -0.718430
6  1.280583 -1.317288  0.899278 -0.032213 -0.347234  2.543415  0.272228 -0.664116 -1.404851 -0.517939
7 -1.201619  0.724669 -0.705984  0.533725  0.820124  0.651339  0.363214  0.727381 -0.282170  0.651201
8  1.829209  0.049628  0.655277 -0.237327 -0.007662  1.849530  0.095479  0.295623 -0.856162 -0.350407
9 -0.690613  1.419008 -0.791556  0.180751 -0.648182  0.240589 -0.247574 -1.947492 -1.010009  1.549234

You can filter like this:

# C in [10, 2] or A in [1, 0]
print(xs_or(df, {"C": [10, 2], "A": [1, 0]}))
# Output
B         7         6                   2         3
D         1         0         3         3         7
0  1.155345 -0.199458  0.575606  0.575606  1.765094
1  0.523277  1.173873  1.378983  1.378983 -0.228162
2  0.049470  0.321156 -1.238887 -1.238887  1.804474
3 -0.758909  0.339307  1.636119  1.636119  1.081766
4 -0.662232 -1.523110  1.698073  1.698073  0.230585
5 -1.379587  0.020645 -1.455928 -1.455928  1.270043
6 -1.317288  1.280583 -0.032213 -0.032213 -0.664116
7  0.724669 -1.201619  0.533725  0.533725  0.727381
8  0.049628  1.829209 -0.237327 -0.237327  0.295623
9  1.419008 -0.690613  0.180751  0.180751 -1.947492
# C in [10, 2] and A in [1, 7]
print(xs_and(df, {"C": [10, 2], "A": [1, 7]}))
# Output
B         6         7
D         3         1
0  0.575606  1.155345
1  1.378983  0.523277
2 -1.238887  0.049470
3  1.636119 -0.758909
4  1.698073 -0.662232
5 -1.455928 -1.379587
6 -0.032213 -1.317288
7  0.533725  0.724669
8 -0.237327  0.049628
9  0.180751  1.419008
Laurent
  • 12,287
  • 7
  • 21
  • 37
  • https://stackoverflow.com/questions/51032985/pandas-conditional-select-using-loc-with-multiindex This answer seemed promising but haven’t tested. – jason m Aug 14 '22 at 19:04