44

I'm working on the kaggle Outbrain competition, and all datasets referenced in my code can be found at https://www.kaggle.com/c/outbrain-click-prediction/data.

On to the problem: I have a dataframe with columns ['document_id', 'category_id', 'confidence_level']. I would like to add a fourth column, 'max_cat', that returns the 'category_id' value that corresponds to the greatest 'confidence_level' value for the row's 'document_id'.

import pandas as pd
main_folder = r'...filepath\data_location' + '\\'
test = pd.read_csv(main_folder + 'documents_categories.csv\documents_categories.csv',nrows=1000)

def find_max(row,the_df,groupby_col,value_col,target_col):
    return the_df[the_df[groupby_col]==row[groupby_col]].loc[the_df[value_col].idxmax()][target_col]

test['max_cat'] = test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'))

This gives me the error:

KeyError: ('document_id', 'occurred at index document_id')

Can anyone help explain either why this error occurred, or how to achieve my goal in a more efficient manner?

cottontail
  • 10,268
  • 18
  • 50
  • 51
user133248
  • 431
  • 1
  • 4
  • 3
  • 23
    pass `axis=1`: `test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'), axis=1)` – EdChum Oct 10 '16 at 14:37
  • 1
    Thanks @EdChum, that fix led me to a second problem with an index mis-match that I was able to solve by myself. I'm also new to stackoverflow, so I'm not familiar with etiquette/how to give you credit for nudging me in the right direction. Appreciate the help! – user133248 Oct 10 '16 at 14:49
  • 5
    basically the issue here is that `apply` on a df works column-wise by default (`axis=0`), your function is expecting a row so you need to pass `axis=1`, see the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html – EdChum Oct 10 '16 at 14:51
  • @EdChum I think your comment should be posted as the answer for the sake of clarity – OriolAbril Apr 10 '18 at 18:12
  • if x is your row, do you need to pass x[index_value] to access the value? in my case, it was: df.apply(lambda x: func(x[0], x[1]), axis=1) I was applying a custom function to the first and second columns and I wanted to run it across all rows. – Nigel D Dec 14 '22 at 15:57

2 Answers2

59

As answered by EdChum in the comments. The issue is that apply works column wise by default (see the docs). Therefore, the column names cannot be accessed.

To specify that it should be applied to each row instead, axis=1 must be passed:

test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'), axis=1)
OriolAbril
  • 7,315
  • 4
  • 29
  • 40
  • 1
    I get `TypeError: () got an unexpected keyword argument 'axis'` – Aaron Bramson Aug 07 '18 at 08:17
  • 3
    It looks like you have misplaced some parenthesis or coma, because it assumes that axis is a parameter of lambda instead of being a parameter of apply – OriolAbril Aug 08 '18 at 12:51
  • There are 2 versions of apply. One works on Series, the other works on Dataframes. Only DataFrames' apply function has "axis" as a keyword. You must be calling apply on a Series, which have only one axis and thus no "axis" argument – ach-agarwal Oct 13 '20 at 22:43
0

Why axis=1?

To expand on Oriol's answer, test is a dataframe and some of the parameters that are passed to find_max() - 'document_id', 'confidence_level' and 'category_id' are column labels, so the function should be called on each row. To do that axis=1 should be passed.

I still get KeyError: 0. What gives?

For the given dataset, even after including axis=1, a KeyError: 0 is raised. The reason is that idxmax() is called on the entire column (the_df[value_col]) so returns the first index of the max values in the entire dataframe but that index is being used to filter a slice of the dataframe (the_df[the_df[groupby_col]==row[groupby_col]]). In short, the dataframe doesn't have the key 0.

If we debug the code a bit, by printing what the slice looks like:

def find_max(row,the_df,groupby_col,value_col,target_col):
    x = the_df[the_df[groupby_col]==row[groupby_col]]
    idx = the_df[value_col].idxmax()
    print('slice:\n', x, end='\n\n')
    print('index:', idx)
    return x.loc[idx][target_col]

it outputs

slice:
    document_id  category_id  confidence_level
2      1524246         1807              0.92
3      1524246         1608              0.07

index: 0

As you can see, this slice's indices are [2, 3], yet idx=0, so when slice.loc[idx] is tried, the KeyError is raised since there is no index 0.

How to perform the filtering correctly and more efficiently

To answer OP's original request

how to achieve my goal in a more efficient manner?

Since the goal is to

return the 'category_id' value that corresponds to the greatest 'confidence_level' value for the row's 'document_id'.

it can be done by transforming the idxmax() across groups using groupby.transform.

df['max_cat'] = df.loc[df.groupby('document_id')['confidence_level'].transform('idxmax'), 'category_id'].tolist()

For the given input, the first 6 rows of the output looks as follows:

result

cottontail
  • 10,268
  • 18
  • 50
  • 51