-1

I have a dataframe (df) with populations for columns, instances as rows, and frequencies as entries (see attached screenshot, there about 2.3M rows and 40 columns). enter image description here

I have a way of sorting populations by their geographical distance. What I would like to do is take the row mean for the top four closest populations that do not have frequency as NA.

If there weren't any NAs, I would do something like:

closest = get_four_closest_pops(focalpop)  # get a list of the four closest pops
whatiwant = df[closest].mean(axis=1)  # doing it this way does not take into account NAs

If I loop through rows it will take way longer than I want it to, even if I use ipyparallel. The pandas.DataFrame.mean() method is quite quick. I was wondering if there wasn't a simple method where I could supply a list of all pops ordered by proximity to a focalpop and then take the four closest non-NAs.

example

For instance, I would like a pd.Series returned (or a dict) that has the rownames and the mean of the first four pops that are not NA.

for row jcf7190000000000-77738 in the screenshot (if for instance columns are ordered by proximity to a pop that is not shown) I would want output as 0.666625 (ie from DF_p1, DF_p24, DF_p25, DF_p26)

for row jcf7190000000000-77764 I would want output as 0.771275 (ie from DF_p1, DF_p24, DF_p25, DF_p26)

for row jcf7190000000004-54418 I would want output as 0.28651 (ie from DF_1, DF_2, DF_23, DF_24)

BML
  • 191
  • 2
  • 12
  • Could you give an example of a part of the dataset and your desired output? – MrNobody33 Jun 29 '20 at 17:10
  • I have updated the question – BML Jun 29 '20 at 17:11
  • But could you give and example with a part of that dataset and the desired output with that part? So someone can work with it and to make your question clearer – MrNobody33 Jun 29 '20 at 17:17
  • The screenshot is the dataset. check question for example output. – BML Jun 29 '20 at 17:19
  • Yeah, but there isn't a minimal, reproducible example. Please, see [how to ask](https://stackoverflow.com/help/how-to-ask) and [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). – MrNobody33 Jun 29 '20 at 18:18

1 Answers1

0

For future questions, see how to ask and How to create a Minimal, Reproducible Example (just a suggestion). Now, with a dummy df, since you didn't add a sample of your dataframe to work with, and taking into account that you have ordered by proximity the columns, you could try this:

import pandas as pd
from math import isnan
from statistics import mean

#creation of dummy df
df = pd.DataFrame({'df_1': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
                    'df_2': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'] ,
                    'df_3': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
                    'df_4': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
                    'df_5': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
                    'df_6': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
                    'df_7': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100'],
                    'df_8': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
                    'df_9': ['DDD','150','350','400','5000','500','XYZ','1200','DDD','150','350'],
                    'df_10': ['700','ABC','500','XYZ','1200','DDD','150','350','400','5000', '100']
                    })
df=df.apply(pd.to_numeric, errors='coerce')
df.index.name='instances'



#approach of a solution to your problem
def func(row):
    meanx=mean([float(x) for x in list(row)[1:] if not isnan(x)][:4])
    columns=[df.columns[i] for i,x in enumerate(list(row)[1:]) if not isnan(x)][:4]
    return {'columns':columns, 'mean':meanx}
dc={row[0]:func(row) for row in df.to_records()}

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df)
print(dc)

Output:

df
             df_1    df_2    df_3    df_4    df_5    df_6    df_7    df_8    df_9   df_10
instances
0           700.0     NaN   700.0     NaN   700.0     NaN   700.0     NaN     NaN   700.0
1             NaN   150.0     NaN   150.0     NaN   150.0     NaN   150.0   150.0     NaN
2           500.0   350.0   500.0   350.0   500.0   350.0   500.0   350.0   350.0   500.0
3             NaN   400.0     NaN   400.0     NaN   400.0     NaN   400.0   400.0     NaN
4          1200.0  5000.0  1200.0  5000.0  1200.0  5000.0  1200.0  5000.0  5000.0  1200.0
5             NaN   500.0     NaN   500.0     NaN   500.0     NaN   500.0   500.0     NaN
6           150.0     NaN   150.0     NaN   150.0     NaN   150.0     NaN     NaN   150.0
7           350.0  1200.0   350.0  1200.0   350.0  1200.0   350.0  1200.0  1200.0   350.0
8           400.0     NaN   400.0     NaN   400.0     NaN   400.0     NaN     NaN   400.0
9          5000.0   150.0  5000.0   150.0  5000.0   150.0  5000.0   150.0   150.0  5000.0
10          100.0   350.0   100.0   350.0   100.0   350.0   100.0   350.0   350.0   100.0

dc
{0: {'columns': ['df_1', 'df_3', 'df_5', 'df_7'], 'mean': 700.0}, 1: {'columns': ['df_2', 'df_4', 'df_6', 'df_8'], 'mean': 150.0}, 2: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 425.0}, 3: {'columns': ['df_2', 'df_4', 'df_6', 'df_8'], 'mean': 400.0}, 4: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 3100.0}, 5: {'columns': ['df_2', 'df_4', 'df_6', 'df_8'], 'mean': 500.0}, 6: {'columns': ['df_1', 'df_3', 'df_5', 'df_7'], 'mean': 150.0}, 7: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 775.0}, 8: {'columns': ['df_1', 'df_3', 'df_5', 'df_7'], 'mean': 400.0}, 9: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 2575.0}, 10: {'columns': ['df_1', 'df_2', 'df_3', 'df_4'], 'mean': 225.0}}

Note: It should be clarified that the program is designed assuming that all instances have at least 4 non-nan values.

MrNobody33
  • 6,413
  • 7
  • 19