0

My dataframe is currently created to be wide with many columns, after the for statement below is executed. I want to stack multiple columns of data so that the dataframe is long and remove any blank rows from col4 before the output dataframe is generated. The reason for the latter part (remove blanks before output is generated) is because the dataframe will be too large for any output to be created with the blank values included.

code:

# dataframe
df0 = pd.DataFrame(data ={'col1':[123,123,456,456],'col2':['one two three',
 'green yellow','four five six','green yellow']})

# words to search for
search_words1 = ['one','three','four','six','green yellow']

# create columns for each search word and indicate if search word is found for each row
for n in search_words1:
        df0[n] = np.where(df0['col2'].str.contains(n),n,'')

# stack all search word columns created and remove blank rows in col4 before output is generated
df0 = pd.concat([
    df0[['col1']].melt(value_name='col3'), 
    df0[['one','three','four','six','green yellow']].melt(value_name='col4')], 
    axis=1)

df0.loc[:,['col3','col4']]

current output:

    col3    col4
0   123.0   one
1   123.0   
2   456.0   
3   456.0   
4   NaN     three
5   NaN 
6   NaN 
7   NaN 
8   NaN 
9   NaN 
10  NaN     four
11  NaN 
12  NaN 
13  NaN 
14  NaN     six
15  NaN 
16  NaN 
17  NaN     green yellow
18  NaN 
19  NaN     green yellow

desired output:

      col3  col4
0   123.0   one
1   123.0   three
2   123.0   green yellow
3   456.0   four
4   456.0   six
5   456.0   green yellow     
hector.h2913
  • 41
  • 2
  • 8
  • What does the original data look like? – ifly6 Feb 28 '22 at 02:44
  • @ifly6 the original data is an ID column (number converted to str) and a text column with lots of text. the above test `dataframe` is similar to the original data but not as large. I need help! The two solutions so far do not help. – hector.h2913 Feb 28 '22 at 22:20

2 Answers2

2

try this:

search_words1 = ['one','three','four','six','green yellow']
search_words1 = '|'.join(search_words1)
df0['col2'] = df0.col2.str.findall(search_words1)
df0.explode('col2')
>>>
    col1    col2
0   123     one
0   123     three
1   123     green yellow
2   456     four
2   456     six
3   456     green yellow

df0['col2'] = df0.col2.str.findall(search_words1)

In this step, you will get the following result:

    col1    col2
0   123     [one, three]
1   123     [green yellow]
2   456     [four, six]
3   456     [green yellow]

The last step, explode 'col2'

df0 = df0.explode('col2')
print(df0)
ziying35
  • 1,190
  • 3
  • 6
1

you can remove all nan and blank spaces like such.

col3 = df0['col3']
col4 = df0['col4']
three = col3[col3.notna()]
four = col4[col4 != ""]
print(three, '\n', four)

out:

0    123.0
1    123.0
2    456.0
3    456.0
Name: col3, dtype: float64 
 0              one
4            three
10            four
14             six
17    green yellow
19    green yellow
Name: col4, dtype: object
Jason Leaver
  • 286
  • 2
  • 11