1

I have labeled data like this:

    Data = {'text': ['when can I decrease the contribution to my health savings?', 'I love my guinea pig', 'I love my dog'],
        'start':[43, 10, 10],
        'end':[57,19, 12],
        'entity':['hsa', 'pet', 'pet'],
        'value':['health savings', 'guinea pig', 'dog']
       } 
    df = pd.DataFrame(Data)

       text               start  end         entity     value
0   .. health savings      43    57          hsa      health savings
1   I love my guinea pig   10    19          pet      guinea pig
2   I love my dog          10    12          pet         dog

Want to split sentences into words and tag each word. If the word is associated with an entity, tag it with that entity.

I have tried the way in this question: Split sentences in pandas into sentence number and words

But that method only works when the value is a single word like 'dog' but won't work if the value is a phrase like 'guinea pig'

Want to perform BIO tagging. B stands for beginning of a phrase. I stands for inside of a phrase. O stands for outside.

Thus the desired output will be:

    Sentence #  Word         Entity
0   Sentence: 0 when            O
1   Sentence: 0 can             O
2   Sentence: 0 I               O
3   Sentence: 0 decrease        O
4   Sentence: 0 the             O
5   Sentence: 0 contribution    O
6   Sentence: 0 to              O
7   Sentence: 0 my              O
8   Sentence: 0 health          B-hsa
9   Sentence: 0 savings?        I-hsa
10  Sentence: 1 I               O
11  Sentence: 1 love            O
12  Sentence: 1 my              O
13  Sentence: 1 guinea          B-pet
14  Sentence: 1 pig             I-pet
15  Sentence: 2 I               O
16  Sentence: 2 love            O
17  Sentence: 2 my              O
18  Sentence: 2 dog             B-pet
Dylan
  • 915
  • 3
  • 13
  • 20

2 Answers2

1

Use:

df1 = (df.set_index(['value','entity'], append=True)
         .text.str.split(expand=True)
         .stack()
         .reset_index(level=3, drop=True)
         .reset_index(name='Word')
         .rename(columns={'level_0':'Sentence'}))

df1['Sentence'] = 'Sentence: ' + df1['Sentence'].astype(str)
w = df1['Word'].str.replace(r'[^\w\s]+', '')
splitted = df1.pop('value').str.split()
e = df1.pop('entity')

m1 = splitted.str[0].eq(w)
m2 = [b in a for a, b in zip(splitted, w)]

df1['Entity'] = np.select([m1, m2 & ~m1], ['B-' + e, 'I-' + e],  default='O')

print (df1)

       Sentence          Word Entity
0   Sentence: 0          when      O
1   Sentence: 0           can      O
2   Sentence: 0             I      O
3   Sentence: 0      decrease      O
4   Sentence: 0           the      O
5   Sentence: 0  contribution      O
6   Sentence: 0            to      O
7   Sentence: 0            my      O
8   Sentence: 0        health  B-hsa
9   Sentence: 0      savings?  I-hsa
10  Sentence: 1             I      O
11  Sentence: 1          love      O
12  Sentence: 1            my      O
13  Sentence: 1        guinea  B-pet
14  Sentence: 1           pig  I-pet
15  Sentence: 2             I      O
16  Sentence: 2          love      O
17  Sentence: 2            my      O
18  Sentence: 2           dog  B-pet

Explanation:

  1. First create new DataFrame by DataFrame.set_index with Series.str.split and DataFrame.stack
  2. some data cleaning by DataFrame.rename_axis, DataFrame.reset_index and rename
  3. Prepend string to Sentencecolumn
  4. Use Series.str.replace for remove punctuation
  5. Use DataFrame.pop for extract column and split for lists
  6. Create mask m1 by compare first value of splited lists
  7. Create mask for compare all values of lists
  8. Create new column by numpy.select
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Step 1: Split your column value based on space by below code:

s = df['value'].str.split(' ').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'value' # needs a name to join
del df['value']
df1 = df.join(s)
df1 =df1.reset_index()

The above step will break your phrases to single words

Step 2: df1 will have values where a new value column and now all you need to do is change your entity column w.r.t to new value column

prev_id = 'x'
for idx,ser in df1.iterrows():
    if ser.text == prev_id:
        df1.loc[idx,'entity'] = 'I-HSA'
    else:
        df1.loc[idx,'entity'] = 'B-HSA'
    prev_id = ser.text

Above code changes the entity field with logic that similar consecutive text will value will change the value as asked.

Step 3: After this your dataframe is similar to the link you posted, simply apply the same solution.

The above answer is taking care of your phrase problem as mentioned in your problem

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51