-1

I oversimplified this greatly - the actual column will contain up to 500 characters and the substring list will have 60 values, anywhere from 10-80 characters

dataframe is more complicated than this, list would contain 60 values, DF would have 11,000 rows, but here is what I am trying to do

I have a dataframe like so, and a list

my_list = ['alabama 500', 'beta 15', 'carthouse', 'd320 blend']

df = pd.DataFrame({'col1':['left side alabama 500 on the right side carthouse', '1st entry is at beta 15', 'this one takes a mix of d320 blend and beta 15']})


    col1
0   left side alabama 500 on the right side carthouse
1   1st entry is at beta 15
2   this one takes a mix of d320 blend and beta 15


I am attempting to write a function that would return this, keep the first column intact and return the substring in a new column with the original column intact

df['col2'] 
    col1                                                  col2
0   left side alabama 500 on the right side carthouse     alabama 500
1   left side alabama 500 on the right side carthouse     carthouse
2   1st entry is at beta 15                               beta 15
3   this one takes a mix of d320 blend and beta 15        beta 15
4   this one takes a mix of d320 blend and beta 15        d320 blend

here's what I have tried

def add_new_col(data, col_name, my_list):
    #function looks at the column col_name in a dataframe data, if the substring exists, it adds a new
    #column with only that substring, keeping multiples
    
    for i in my_list:
        if data[col_name].str.contains(i):
            data['col2'] = i
        else:
            continue
    return data

running the function in the notebook

my_list = ['a', 'b', 'c', 'd']
add_new_col(df, 'col1', my_list)

returns this error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

based on some other answers I also tried this

def add_new_col(data, col_name, my_list):
    #function looks at the column col_name in a dataframe data, if the substring exists, it adds a new
    #column with only that substring, keeping multiples
    
    for i in my_list:
        if data[data[col_name].str.contains(i)]:
            data['col2'] = i
        else:
            continue
    return data

which gave the same error code

robert_553z8
  • 172
  • 10
  • Don't understand the logic for col2. For instance, rows 0 & 1 in col1 are both 'acd', but in col2 we get 'a' for row 0 and 'c' for row 1. – DarrylG Dec 07 '20 at 17:18

2 Answers2

1

You can use the explode method:

df2 = df.assign(col2=lambda f: f.col1.apply(list)).explode("col2")
print(df2)
  col1 col2
0  acd    a
0  acd    c
0  acd    d
1    a    a
2   db    d
2   db    b

If you want to get rid of the index just add: df2 = df2.reset_index(drop=True)

Horace
  • 1,024
  • 7
  • 12
  • the actual column 1 will contain up to 500 characters, this was an oversimplification, column 2 is a list of 60 items, all strings up to 80 characters long – robert_553z8 Dec 07 '20 at 17:32
0

You can use str.split and concat:

import numpy as np
s = df['col1'].str.split('',expand=True).replace('',np.nan).stack()\
              .reset_index(1,drop=True).to_frame('col2')

df1 = pd.concat([df,s],1)

  col1 col2
0  acd    a
0  acd    c
0  acd    d
1    a    a
2   db    d
2   db    b
halfer
  • 19,824
  • 17
  • 99
  • 186
Umar.H
  • 22,559
  • 7
  • 39
  • 74