Python: Find starting, ending index of sub-text column from another text column

Question

I am trying my hand at Question Answering and have to make my own dataset. I have 5 columns:

question | context | answer | answer_start | answer_end

Each record in the context column has a chunk of text, e.g.,

Neil Alden Armstrong (August 5, 1930 – August 25, 2012) was an American astronaut and aeronautical engineer, and the first person to walk on the Moon. He was also a naval aviator, test pilot, and university professor.

The corresponding answer contains a string of text extracted from the context, e.g.,

the first person to walk on the Moon

I need to populate answer_start and answer_end, which are the starting/ending indexes of the answer text within context. In the above example, answer_start would be 114 & answer_end would be 150. They are currently empty columns.

I tried the following:

df['answer_start'].apply(lambda x: re.search(x['answer'], x['context']).start())

But it threw an error:

TypeError: 'int' object is not subscriptable

Is there a way to fix what I have? Is there a way to do this that doesn't require a loop?

score 0 · Answer 1 · answered Aug 19 '21 at 21:36

0

You should use df.apply instead of df['answer_start'].apply. Because you are using apply on a series what you get as an x is not a row but an integer number

answered Aug 19 '21 at 21:36

alparslan mimaroğlu

1,450
1
10
20

Thanks. I tried that but then got "KeyError: variableName" – aapal Aug 19 '21 at 21:39
you should also change the axis to 1 like `df.apply(your_function, axis=1)` – alparslan mimaroğlu Aug 19 '21 at 21:56

MDR · Answer 2 · 2021-08-19T22:01:02.590

0

Try:

df = pd.DataFrame({'context': 
                   ['The cat sat on the mat', 'Around the world we go'],
                   'answer': ['mat', 'world']})


df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer'].str.len() + df['answer_start']

print(df)

                  context answer  answer_start  answer_end
0  The cat sat on the mat    mat            19          22
1  Around the world we go  world            11          16

edited Aug 19 '21 at 22:01

answered Aug 19 '21 at 21:56

MDR

2,610
1
8
18

1

`len(df['answer'])` returns the number of rows not the length of the string in the cell – Corralien Aug 19 '21 at 21:58
Doh, fixed. Thanks. – MDR Aug 19 '21 at 22:02

Corralien · Accepted Answer · 2021-08-20T04:38:46.863

0

Try:

df['answer_start'] = df.apply(lambda x: x['context'].find(x['answer']), axis=1)
df['answer_end'] = df['answer_start'] + df['answer'].str.len()

>>> df[['answer_start', 'answer_end']]
   answer_start  answer_end
0           113         149

edited Aug 20 '21 at 04:38

answered Aug 19 '21 at 21:57

Corralien

109,409
8
28
52

score 0 · Answer 4 · answered Aug 19 '21 at 22:04

You can use the context and answer fields to calculate the indices. to use multiple fields you should use df.apply. For example I have created a toy dataset:

import pandas as pd

text = "this is text number"

df = pd.DataFrame({"A": [f"{text} {i+1}" for i in range(4)], "B": text.split(" ")})

The data looks as follows:

                       A       B
0  this is text number 1    this
1  this is text number 2      is
2  this is text number 3    text
3  this is text number 4  number

Now we can calculate the start and end index values:

df["start"] = df.apply(lambda row: row["A"].find(row["B"]), axis=1)
df["end"] = df.apply(lambda row: row["start"] + len(row["B"]), axis=1)

And this is the result:

                       A       B  start  end
0  this is text number 1    this      0    4
1  this is text number 2      is      2    4
2  this is text number 3    text      8   12
3  this is text number 4  number     13   19

Python: Find starting, ending index of sub-text column from another text column

4 Answers4