0

Currently I have a dataframe. Here is an example of my dataframe: enter image description here

I also have a list of keywords/ sentences. I want to match it to the column 'Content' and see if any of the keywords or sentences match.

Here is what I've done enter image description here

# instructions_list is just the list of keywords and key sentences 
instructions_list = instructions['Key words & sentence search'].tolist()
pattern = '|'.join(instructions_list)


bureau_de_sante[bureau_de_sante['Content'].str.contains(pattern, regex = True)]

While it is giving me the results, it is also giving me this UserWarning : UserWarning: This pattern has match groups. To actually get the groups, use str.extract. return func(self, *args, **kwargs).

enter image description here

Questions:

  1. How can I prevent the userwarning from showing up?
  2. After finding and see if a match is in the column, how can I print the specific match in a new column?
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214

1 Answers1

0

You are supplying a regex to search the dataframe. If you have parenthesis in your instruction list (like it is the case in your example), then that constitutes a match group. In order to avoid this, you have to escape them (i.e.: add \ in front of them, so that (Critical risk) becomes \(Critical risk\)). You will also probably want to escape all special characters like \ . " ' etc.

Now, you can use these groups to extract the match from your data. Here is an example:

df = pd.DataFrame(["Hello World", "Foo Bar Baz", "Goodbye"], columns=["text"])
pattern = "(World|Bar)"
print(df.str.extract(pattern))
#        0
# 0  World
# 1    Bar
# 2    NaN

You can add this column in your dataframe with a simple assignment (eg df["result"] = df.str.extract(pattern))

qmeeus
  • 2,341
  • 2
  • 12
  • 21
  • is there some built-in function that allows escaping all special characters? – angleofaxis Oct 01 '20 at 16:19
  • https://stackoverflow.com/questions/18935754/how-to-escape-special-characters-of-a-string-with-single-backslashes – qmeeus Oct 01 '20 at 16:23
  • also when i try to use the .str.extract(pattern) that u mentioned on the data I am working on. im only getting NaNs. do you know the issue that's casuing that to show rather than the matches? – angleofaxis Oct 01 '20 at 16:25
  • Yes that means that there are no matches. Chances are that your regex is a bit all over the place, since you are joining complete sentences together. Is there any way to simplify it given your usecase? If not, you might be better off using a for loop over your instruction list – qmeeus Oct 01 '20 at 16:32