0

Here's my data

Id  Keyword
1   ayam e-commerce
2   biaya fuel personal wallet
3   pulsa sms virtualaccount
4   biaya koperasi personal
5   familymart personal
6   e-commerce pln
7   biaya onus
8   koperasi personal
9   biaya familymart personal
10  fuel personal wallet
11  fuel travel

What I want that every keyword that exist keyword such as fuel, pln, and ayam is shortening to fuel, pln, or ayam, so output gonna be like this

Id  Keyword
1   ayam
2   biaya fuel personal wallet
3   pulsa sms virtualaccount
4   biaya koperasi personal
5   familymart personal
6   pln
7   biaya onus
8   koperasi personal
9   biaya familymart personal
10  fuel
11  fuel

How suppose I do this?

smci
  • 32,567
  • 20
  • 113
  • 146
Nabih Bawazir
  • 6,381
  • 7
  • 37
  • 70

1 Answers1

1

For replace only first matched word use contains in loop:

L = ['fuel', 'pln', 'ayam']
for x in L:
    df.loc[df['Keyword'].str.contains(x), 'Keyword'] = x

Or nested list comprehension:

L = ['fuel', 'pln', 'ayam']
df['Keyword'] = [next(iter([z for z in L if z in x]), x) for x in df['Keyword']]

Or extract with fillna for replace missing values by original values:

L = ['fuel', 'pln', 'ayam']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['Keyword'] = df['Keyword'].str.extract('('+ pat + ')', expand=False).fillna(df['Keyword'])


print (df)
    Id                    Keyword
0    1                       ayam
1    2                       fuel
2    3   pulsa sms virtualaccount
3    4    biaya koperasi personal
4    5        familymart personal
5    6                        pln
6    7                 biaya onus
7    8          koperasi personal
8    9  biaya familymart personal
9   10                       fuel
10  11                       fuel

If need all matched values use findall with join and replace non empty values to original by loc:

print (df)
   Id                   Keyword
0   1           ayam e-commerce
1   2     biaya fuel pln wallet <- matched 2 keywords
2   3  pulsa sms virtualaccount

pat = '|'.join(r"\b{}\b".format(x) for x in L)
s = df['Keyword'].str.findall('('+ pat + ')').str.join(', ')
df.loc[s != '', 'Keyword'] = s
print (df)
   Id                   Keyword
0   1                      ayam
1   2                 fuel, pln
2   3  pulsa sms virtualaccount
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • You might as well add example rows containing multiple values from 'fuel', 'pln', and 'ayam' to illustrate the difference in output, otherwise the results from your code will look identical. – smci Feb 25 '19 at 08:11
  • ...oh and possibly in different orders, if that makes a difference for some code... – smci Feb 25 '19 at 08:14