1

I have a processed dataframe which is used as a input to train a NLP model:

 sentence_id    words   labels
0   0            a      B-ORG
1   0            b      I-ORG
2   0            c      I-ORG
5   1            d      B-ORG
6   1            e      I-ORG
7   2            f      B-PER
8   2            g      I-PER

I need to convert this into ConLL text format as below:

a B-ORG
b I-ORG
c I-ORG

d B-ORG
e I-ORG

f B-PER
g I-PER

The CoNLL format is a text file with one word per line with sentences separated by an empty line. The first word in a line should be the word and the last word should be the label.

Anyone have any idea how to do that?

Shyam
  • 357
  • 1
  • 9

1 Answers1

1

First join both columns by space anf then in DataFrame.groupby add last empty value with write to file:

df['join'] = df['words'] + ' ' + df['labels']
#alternative
#df['join'] = df['words'].str.cat(df['labels'], sep=' ')
for i, g in df.groupby('sentence_id')['join']:
    out = g.append(pd.Series({'new':np.nan}))
    out.to_csv('file.txt', index=False, header=None, mode='a')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • It has been converted but the issue is it is adding extra space to blank line which is giving me error while training `out = g.append(pd.Series({'new':' '}))` – Shyam Feb 25 '21 at 14:17
  • @Shyam - Can you try change `'new':' '` to `'new':''` ? – jezrael Feb 25 '21 at 14:18
  • Its also giving output as ```den B-ORG channel I-ORG iii I-ORG cable I-ORG networks I-ORG private I-ORG limited I-ORG "" bharat B-ORG tea I-ORG trading I-ORG company I-ORG limited I-ORG``` – Shyam Feb 25 '21 at 14:26
  • @Shyam - Understand, I hope `{'new':np.nan}` working well – jezrael Feb 25 '21 at 14:27
  • Nope giving same output – Shyam Feb 25 '21 at 14:29