How to find average of specific column entries based on similarity in their rows?

Question

I have a text-file which has 9 columns and many rows (around 30k). The entries in the first five columns in some of its rows can have same value. In such case I want to convert them into a single row, where the value in the entries of 6th-8th columns would be the mean. If a row is unique then I want to print it as it is. My original file looks like this.

6nbn    A   18  49  A   1.82270650408   2.03219831709   1.82706048066   1
6nbn    A   45  98  A   1.82498684927   2.03457366541   1.82271363631   1
6nbn    A   88  107 A   1.82115046056   2.03480564182   1.82785940378   1
6nbn    A   18  49  A   1.81906074665   2.03189099117   1.82705062875   2
6nbn    A   45  98  A   1.82562290739   2.03479384705   1.82313137212   2
6nbn    A   88  107 A   1.82279510642   2.03515331118   1.82660203657   2
6nbn    A   18  49  A   1.82147248126   2.03104332795   1.82474573571   3
6nbn    A   45  98  A   1.82470216748   2.03683136268   1.82329893325   3
6nbn    A   88  107 A   1.82258525178   2.0307116979    1.8247273769    3
8tfv    A   11  18  A   1.81042122171   2.01948136906   1.80238314462   1
8tfv    A   11  18  A   1.80688488842   2.02074367499   1.8064168954    2
8tfv    A   11  18  A   1.80874790947   2.02178955384   1.80609219034   3
8tfv    A   11  18  A   1.80850988385   2.01873277082   1.80290765155   4
8tfv    A   11  18  A   1.80312229203   2.01855121312   1.80927195302   5
8t11    B   1   4   A   1.80874790947   2.02178955384   1.80609219034   1

And I want my output file like this:

6nbn    A   18  49  A   1.82107991066   2.03171087874   1.82628561504   
6nbn    A   45  98  A   1.82510397471   2.03539962505   1.82304798056   
6nbn    A   88  107 A   1.82217693958   2.03355688363   1.82639627242   
8tfv    A   11  18  A   1.80753723909   2.01985971637   1.80541436699   
8t11    B   1   4   A   1.80874790947   2.02178955384   1.80609219034

I am a novice in python programming. I would be a great help if you could help me to solve this problem.

you need to use `pandas` with `groupby` option – Zaraki Kenpachi Mar 13 '20 at 08:18 — Zaraki Kenpachi, Mar 13 '20 at 08:18

score 0 · Answer 1 · answered Mar 13 '20 at 08:28

0

Try this (replace the numbers with your column names):

df.groupby(['0','1','2','3','4'])['5','6','7'].mean()

                        5         6         7
0    1 2  3   4                              
6nbn A 18 49  A  1.821080  2.031711  1.826286
       45 98  A  1.825104  2.035400  1.823048
       88 107 A  1.822177  2.033557  1.826396
8t11 B 1  4   A  1.808748  2.021790  1.806092
8tfv A 11 18  A  1.807537  2.019860  1.805414

answered Mar 13 '20 at 08:28

luigigi

4,146
1
13
30

Dear luigigi, Can you please tell me why I am getting this error with your code?KeyError: '0' – Abhijit Rana Mar 15 '20 at 04:15
Like i said. you have to replace the numbers with your column names – luigigi Mar 16 '20 at 08:34
Yes it worked. But what is the reason behind replacing with names – Abhijit Rana Mar 19 '20 at 11:32
i just used numbers as names because i dont know how your column names are – luigigi Mar 19 '20 at 12:13

score 0 · Answer 2 · answered Mar 13 '20 at 08:29

import pandas as pd
from io import StringIO


data = StringIO("""
6nbn A 18 49 A 1.82270650408 2.03219831709 1.82706048066 1
6nbn A 18 49 A 1.81906074665 2.03189099117 1.82705062875 2
6nbn A 45 98 A 1.82562290739 2.03479384705 1.82313137212 2
""")

df = pd.read_csv(data, sep=' ', engine='python', names=['a','b','c','d','e','f','g','h','i'])
result = df.groupby(['a','b','c','d','e']).agg('mean')

Output:

                       f         g         h    i
a    b c  d  e                                   
6nbn A 18 49 A  1.820884  2.032045  1.827056  1.5
       45 98 A  1.825623  2.034794  1.823131  2.0

With this code I am getting the following error: TypeError: initial_value must be unicode or None, not str — Abhijit Rana, Mar 15 '20 at 03:32

How to find average of specific column entries based on similarity in their rows?

2 Answers2