Filter columns by values in a row in Pandas

Question

I have obtained the statistics for my dataframe by df.describe() in Pandas.

statistics = df.describe()

I want to filter the statistics dataframe base on count:

    main    Meas1     Meas2 Meas3   Meas4  Meas5
    sublvl  Value     Value Value   Value   Value       
    count   7.000000  1.0   1.0     582.00  97.000000       
    mean    30        37.0  26.0    33.03   16.635350

I want to get something like that: filter out all Values with count less than 30 and show me only the columns with count >30 in a new dataframe (or give me a list with all main that have count>30).

For the above example, I want:

    main       Meas4    Meas5
    sublvl     Value    Value       
    count      582.00   97.000000       
    mean       33.03    16.635350

and [Meas4, Meas5]

I have tried

thresh = statistics.columns[statistics['count']>30]

And variations thereof.

`df[df.columns[(df['count'] > 30).values]]` should work. If not, please provide some code to reproduce exactly the dataframe you have. — cs95, Aug 21 '17 at 14:53
@cᴏʟᴅsᴘᴇᴇᴅ this throws a KeyError. The dataframe I have is made by df.describe() from a two level df with first level Meas1,2,3 etc. and second level "Value" — Moiraine24, Aug 21 '17 at 15:01
`[column for column in statistics.columns if statistics.loc['count'][column] > 3]` This is from the answer i have shared. I tried it and it works. please check. — Vikash Singh, Aug 21 '17 at 15:06
@VikashSingh how can I get only the first level of the multilevel in the list? so from my example only Meas1, Meas2, without the Value subindex? — Moiraine24, Aug 21 '17 at 15:11
you can simply do this `[column[0] for column in statistics.columns if statistics.loc['count'][column] > 3]`. this will chose the first level of the column. — Vikash Singh, Aug 21 '17 at 15:13

Vikash Singh · Accepted Answer · 2017-08-21T15:14:51.270

1

import pandas as pd

df = pd.DataFrame.from_dict({'name':[1,2,3,4,5], 'val':[1, None,None,None,None]})

df

name    val
0   1   1.0
1   2   NaN
2   3   NaN
3   4   NaN
4   5   NaN

if you want to use describe() then note that describe does not give all columns. only columns with numerical data types are returned by default:

you can do so in this way:

statistics = df.describe()

# to describe all columns you can do this
statistics = df.describe(include = 'all')

[column for column in statistics.columns if statistics.loc['count'][column] > 3]
# output ['name']

As discussed in comments, As this is a MultiIndex column to chose only first index we can do this:

# [column[0] for column in statistics.columns if statistics.loc['count'][column] > 3] # this code won't work correctly for non multi index dataframes.

for each column check if count is > threshold and add it to chosen_columns list:

chosen_columns = []
for column in df.columns:
    if len(df[column].value_counts()) > 3:
        chosen_columns.append(column)

# chosen_columns output: ['name']

OR:

chosen_columns = []
for column in df.columns:
    if df[column].count() > 3:
        chosen_columns.append(column)

# chosen_columns output: ['name']

edited Aug 21 '17 at 15:14

answered Aug 21 '17 at 14:49

Vikash Singh

13,213
8
40
70

I want to use the resulting statistics dataframe (it actually plays in my favor that statistics uses only numerical value columns) and perform the filter on it. – Moiraine24 Aug 21 '17 at 15:02
@Moiraine24 not a problem. I have added answer for all the options. – Vikash Singh Aug 21 '17 at 15:03
This list however gives me both indeces as tuples,[('Meas1', 'Value'), ('Meas2', 'Value'), ('Meas3', 'Value'),... – Moiraine24 Aug 21 '17 at 15:08
@Moiraine24 you can use a simple list comprehension to filter the 'Meas's out. `[v[0] for v in values]`. Can you share your original dataframe so I can figure out why this is happening? – Vikash Singh Aug 21 '17 at 15:11

score 1 · Answer 2 · answered Sep 30 '20 at 03:30

1

Selecting value by column is much more easy in pandas, so you can do transpose first, select as you wishes, and transpose back.

statistics.T[statistics.T['count']>30].T

answered Sep 30 '20 at 03:30

HuangHudson

11
1
3

Filter columns by values in a row in Pandas

2 Answers2