2

I have obtained the statistics for my dataframe by df.describe() in Pandas.

statistics = df.describe()

I want to filter the statistics dataframe base on count:

    main    Meas1     Meas2 Meas3   Meas4  Meas5
    sublvl  Value     Value Value   Value   Value       
    count   7.000000  1.0   1.0     582.00  97.000000       
    mean    30        37.0  26.0    33.03   16.635350

I want to get something like that: filter out all Values with count less than 30 and show me only the columns with count >30 in a new dataframe (or give me a list with all main that have count>30).

For the above example, I want:

    main       Meas4    Meas5
    sublvl     Value    Value       
    count      582.00   97.000000       
    mean       33.03    16.635350

and [Meas4, Meas5]

I have tried

thresh = statistics.columns[statistics['count']>30]

And variations thereof.

cs95
  • 379,657
  • 97
  • 704
  • 746
Moiraine24
  • 369
  • 1
  • 5
  • 16
  • `df[df.columns[(df['count'] > 30).values]]` should work. If not, please provide some code to reproduce exactly the dataframe you have. – cs95 Aug 21 '17 at 14:53
  • 1
    @cᴏʟᴅsᴘᴇᴇᴅ this throws a KeyError. The dataframe I have is made by df.describe() from a two level df with first level Meas1,2,3 etc. and second level "Value" – Moiraine24 Aug 21 '17 at 15:01
  • And what is the index? – cs95 Aug 21 '17 at 15:02
  • The standard describe() index:: {count, mean, std...} etc – Moiraine24 Aug 21 '17 at 15:05
  • `[column for column in statistics.columns if statistics.loc['count'][column] > 3]` This is from the answer i have shared. I tried it and it works. please check. – Vikash Singh Aug 21 '17 at 15:06
  • @VikashSingh how can I get only the first level of the multilevel in the list? so from my example only Meas1, Meas2, without the Value subindex? – Moiraine24 Aug 21 '17 at 15:11
  • you can simply do this `[column[0] for column in statistics.columns if statistics.loc['count'][column] > 3]`. this will chose the first level of the column. – Vikash Singh Aug 21 '17 at 15:13
  • @Moiraine24 Are you still looking for an answer? – cs95 Aug 21 '17 at 15:24

2 Answers2

1
import pandas as pd

df = pd.DataFrame.from_dict({'name':[1,2,3,4,5], 'val':[1, None,None,None,None]})

df

name    val
0   1   1.0
1   2   NaN
2   3   NaN
3   4   NaN
4   5   NaN

if you want to use describe() then note that describe does not give all columns. only columns with numerical data types are returned by default:

you can do so in this way:

statistics = df.describe()

# to describe all columns you can do this
statistics = df.describe(include = 'all')

[column for column in statistics.columns if statistics.loc['count'][column] > 3]
# output ['name']

As discussed in comments, As this is a MultiIndex column to chose only first index we can do this:

# [column[0] for column in statistics.columns if statistics.loc['count'][column] > 3] # this code won't work correctly for non multi index dataframes.

for each column check if count is > threshold and add it to chosen_columns list:

chosen_columns = []
for column in df.columns:
    if len(df[column].value_counts()) > 3:
        chosen_columns.append(column)

# chosen_columns output: ['name']

OR:

chosen_columns = []
for column in df.columns:
    if df[column].count() > 3:
        chosen_columns.append(column)

# chosen_columns output: ['name']
Vikash Singh
  • 13,213
  • 8
  • 40
  • 70
  • I want to use the resulting statistics dataframe (it actually plays in my favor that statistics uses only numerical value columns) and perform the filter on it. – Moiraine24 Aug 21 '17 at 15:02
  • @Moiraine24 not a problem. I have added answer for all the options. – Vikash Singh Aug 21 '17 at 15:03
  • This list however gives me both indeces as tuples,[('Meas1', 'Value'), ('Meas2', 'Value'), ('Meas3', 'Value'),... – Moiraine24 Aug 21 '17 at 15:08
  • @Moiraine24 you can use a simple list comprehension to filter the 'Meas's out. `[v[0] for v in values]`. Can you share your original dataframe so I can figure out why this is happening? – Vikash Singh Aug 21 '17 at 15:11
1

Selecting value by column is much more easy in pandas, so you can do transpose first, select as you wishes, and transpose back.

statistics.T[statistics.T['count']>30].T
HuangHudson
  • 11
  • 1
  • 3