I'm working on the kaggle Outbrain competition, and all datasets referenced in my code can be found at https://www.kaggle.com/c/outbrain-click-prediction/data.
On to the problem: I have a dataframe with columns ['document_id', 'category_id', 'confidence_level']
. I would like to add a fourth column, 'max_cat'
, that returns the 'category_id'
value that corresponds to the greatest 'confidence_level'
value for the row's 'document_id'
.
import pandas as pd
main_folder = r'...filepath\data_location' + '\\'
test = pd.read_csv(main_folder + 'documents_categories.csv\documents_categories.csv',nrows=1000)
def find_max(row,the_df,groupby_col,value_col,target_col):
return the_df[the_df[groupby_col]==row[groupby_col]].loc[the_df[value_col].idxmax()][target_col]
test['max_cat'] = test.apply(lambda x: find_max(x,test,'document_id','confidence_level','category_id'))
This gives me the error:
KeyError: ('document_id', 'occurred at index document_id')
Can anyone help explain either why this error occurred, or how to achieve my goal in a more efficient manner?