13

I have a dataframe with about 100 columns that looks like this:

   Id  Economics-1  English-107  English-2  History-3  Economics-zz  Economics-2  \
0  56          1            1          0        1       0           0   
1  11          0            0          0        0       1           0   
2   6          0            0          1        0       0           1   
3  43          0            0          0        1       0           1   
4  14          0            1          0        0       1           0   

   Histo      Economics-51      Literature-re         Literatureu4  
0           1            0           1                0  
1           0            0           0                1  
2           0            0           0                0  
3           0            1           1                0  
4           1            0           0                0  

My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":

    Id  Economics      English    History  Literature  
0  56          1            1          2        1                     
1  11          1            0          0        1                    
2   6          0            1          1        0                     
3  43          2            0          1        1                     
4  14          0            1          1        0          

For this purpose, I have tried two methods. First method:

df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]

Second method:

df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
    df['History'] = df[filter_col].sum(axes=1)
    print df['History', df[filter_col]]

However, both gives the error:

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.

gustafbstrom
  • 1,622
  • 4
  • 25
  • 44
Amanda
  • 835
  • 2
  • 9
  • 17

3 Answers3

14

I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.

Consider the following:

df = pd.DataFrame({
        'a_a': [1, 2, 3, 4],
        'a_b': [2, 3, 4, 5],
        'b_a': [1, 2, 3, 4],
        'b_b': [2, 3, 4, 5],
    })

Now

[s.split('_')[0] for s in df.T.index.values]

is the prefix of the columns. So

>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
    a   b
0   3   3
1   5   5
2   7   7
3   9   9

does what you want.

In your case, make sure to split using the '-' character.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • 4
    `groupby` accepts an axis=1 argument, so something like `df.groupby(df.columns.str.split("-").str[0],axis=1).sum()` might work. (That `Histo` column makes me wonder if we'd have to use a different grouping condition, but you get the idea.) – DSM Mar 02 '16 at 12:24
  • 1
    Adding to this great answer that one can also incorporate RegEx here in case the column names include multiple `_` characters: `import re` then `[re.split('_\d{2}', s)[0] for s in df.T.index.values]` – ksbawpn Sep 07 '21 at 16:07
12

You can use these to create sum of columns starting with specific name,

df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
Raghul
  • 121
  • 1
  • 5
  • 2
    This is a much more elegant solution. Thanks for contributing to this 5 year old post – MJK Apr 22 '21 at 18:23
4

Using brilliant DSM's idea:

from __future__ import print_function

import pandas as pd

categories = set(['Economics', 'English', 'Histo', 'Literature'])

def correct_categories(cols):
    return [cat for col in cols for cat in categories if col.startswith(cat)]    

df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')

#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())

Output:

    Economics  English  Histo  Literature
Id
56          1        1      2           1
11          1        0      0           1
6           1        1      0           0
43          2        0      1           1
14          1        1      1           0

Here is another version, which takes care of "Histo/History" problematic..

from __future__ import print_function

import pandas as pd

#categories = set(['Economics', 'English', 'Histo', 'Literature'])

#
# mapping: common starting pattern: desired name
#
categories = {
    'Histo': 'History',
    'Economics': 'Economics',
    'English': 'English',
    'Literature': 'Literature'
}

def correct_categories(cols):
    return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]

df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())

rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])

Output:

    Economics  English  History  Literature
Id
56          1        1        2           1
11          1        0        0           1
6           1        1        0           0
43          2        0        1           1
14          1        1        1           0
History
 Id
56    2
11    0
6     0
43    1
14    1
Name: History, dtype: int64

PS You may want to add missing categories to categories map/dictionary

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thanks for your answer, awesome solution, but I get an issue with accessing the columns : with your methods I have always keyError, even after trying different solutions of encoding while reading CSV. Do even when I try to take `df['History']` I have an error, but no error with `df.get('history')` – Amanda Mar 02 '16 at 15:36
  • @Amanda, one of your columns had 'Histo' as a name, so i decided to use "Histo" as a common name, otherwise you would have both: Histo and History. So try df['Histo'] instead – MaxU - stand with Ukraine Mar 02 '16 at 15:47
  • @Amanda, what would you like to do with Histo column? Do you want to have both "Histo" and "History" in the result data frame? – MaxU - stand with Ukraine Mar 02 '16 at 15:56
  • thank you for your answer, actually the problem is not only in history, but in accessin the column name in general : http://stackoverflow.com/questions/35764172/accessing-the-column-in-pandas-in-different-way ! your answer is good, but when there a multiple dataframes that should be treated, the KeyError blocks everything – Amanda Mar 03 '16 at 05:48