0

I am trying my hand at Question Answering and have to make my own dataset. I have 5 columns:

question | context | answer | answer_start | answer_end

Each record in the context column has a chunk of text, e.g.,

Neil Alden Armstrong (August 5, 1930 – August 25, 2012) was an American astronaut and aeronautical engineer, and the first person to walk on the Moon. He was also a naval aviator, test pilot, and university professor.

The corresponding answer contains a string of text extracted from the context, e.g.,

the first person to walk on the Moon

I need to populate answer_start and answer_end, which are the starting/ending indexes of the answer text within context. In the above example, answer_start would be 114 & answer_end would be 150. They are currently empty columns.

I tried the following:

df['answer_start'].apply(lambda x: re.search(x['answer'], x['context']).start())

But it threw an error:

TypeError: 'int' object is not subscriptable

Is there a way to fix what I have? Is there a way to do this that doesn't require a loop?

aapal
  • 13
  • 4

4 Answers4

0

You should use df.apply instead of df['answer_start'].apply. Because you are using apply on a series what you get as an x is not a row but an integer number

alparslan mimaroğlu
  • 1,450
  • 1
  • 10
  • 20
0

Try:

df = pd.DataFrame({'context': 
                   ['The cat sat on the mat', 'Around the world we go'],
                   'answer': ['mat', 'world']})


df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer'].str.len() + df['answer_start']

print(df)

                  context answer  answer_start  answer_end
0  The cat sat on the mat    mat            19          22
1  Around the world we go  world            11          16
MDR
  • 2,610
  • 1
  • 8
  • 18
0

Try:

df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer_start'] + df['answer'].str.len()
>>> df[['answer_start', 'answer_end']]
   answer_start  answer_end
0           113         149
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

You can use the context and answer fields to calculate the indices. to use multiple fields you should use df.apply. For example I have created a toy dataset:

import pandas as pd

text = "this is text number"

df = pd.DataFrame({"A": [f"{text} {i+1}" for i in range(4)], "B": text.split(" ")})

The data looks as follows:

                       A       B
0  this is text number 1    this
1  this is text number 2      is
2  this is text number 3    text
3  this is text number 4  number

Now we can calculate the start and end index values:

df["start"] = df.apply(lambda row: row["A"].find(row["B"]), axis=1)
df["end"] = df.apply(lambda row: row["start"] + len(row["B"]), axis=1)

And this is the result:

                       A       B  start  end
0  this is text number 1    this      0    4
1  this is text number 2      is      2    4
2  this is text number 3    text      8   12
3  this is text number 4  number     13   19
ronpi
  • 470
  • 3
  • 8