1

enter image description here

I am searching through a large spreadsheet with 300 columns and over 200k rows. I would like to create a column that has the the column header and matching cell value. Some thing that looks like "Column||Value." I have the search term and the join aggregator. I can get the row index name but I'm struggling getting the matching column and specific cell. Here's me code so far

df = pd.read_excel (r"Test_file")

mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)

df['extract'] = df.loc[mask] #This only give me the index name. I would like the actual matched cell contents.

df['extract2'] = Column name

df['Match'] = df[['extract', 'extract2']].agg('||'.join.axis=1)

df.drop(['extract', 'extract2'], axis=1)

Final output should look something like Output

ryda
  • 38
  • 5

2 Answers2

0

You can create a mask for a specific column first (I edited your 2nd line a bit), then create a new 'Match' column with all values initialized to 'No Match', and finally, change the values to your desired format ("Column||Value") for rows that are returned after applying the mask. I implemented this in the following sample code:

def match_column(df, column_name):
    column_mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm']))[column_name]
    df['Match'] = 'No Match'
    df.loc[column_mask, 'Match'] = column_name + ' || ' + df[column_name]
    return df

df = {
    'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
    'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)

df = match_column(df, 'Segment')
display(df)

Output:

enter image description here

However, this only works for a single column. I don't know what output you want for cases when there are matches in multiple columns (if you can, please specify).

UPDATE:

If you want to use a list of columns as input and match with the first instance, you can use this instead:

def match_first_column(df, column_list): 
    df['Match'] = 'No Match'
    # iterate over rows
    for index, row in df.iterrows():
        # iterate over column names
        for column_name in column_list:
            column_value = row[column_name]
            substrings = ['Chann', 'Midm', 'Fran']
            # if a match is found
            if any(x in column_value for x in substrings):
                # add match string
                df.loc[index, 'Match'] = column_name + ' || ' + column_value
                # stop iterating and move to next row
                break
    return df

df = {
    'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
    'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)

column_list= df.columns.tolist() 
match_first_column(df, column_list)

Output:

enter image description here

jollibobert
  • 333
  • 1
  • 8
  • 1
    Thank you. This is great but I am attempting to search the entire dataframe and get the first match instance. I attempted to edit your code by making column_name into a list of columns but I am getting errors: 'can only concatenate list (not "str") to list' – ryda Oct 06 '21 at 19:51
  • def match_column(df, column_name): column_mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm']))[column_name] df['Match'] = 'No Match' df.loc[column_mask, 'Match'] = column_name + ' || ' + df[column_name] return df column_name = df.columns.tolist() match_column(df, column_name) – ryda Oct 06 '21 at 19:53
  • I added a new edited function for a list of columns and getting the first match. See if it works for you. :) – jollibobert Oct 07 '21 at 07:57
  • Wow! Thank you. This is exactly what I was trying to do. And thanks for the explanation – ryda Oct 10 '21 at 09:01
0

You can try:

mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df.loc[mask, 'Match'] = '||'.join(df[['extract', 'extract2']])
df['Match'].fillna('No Match', inplace=True)
Muhammad Hassan
  • 4,079
  • 1
  • 13
  • 27