0

I have a pandas dataframe called df. It has a column called article. The article column contains 600 strings, each of the strings represent a news article. I want to only KEEP those articles whose first four sentences contain keywords "COVID-19" AND ("China" OR "Chinese"). But I´m unable to find a way to conduct this on my own.

(in the string, sentences are separated by \n. An example article looks like this:)

\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......
David Buck
  • 3,752
  • 35
  • 31
  • 35
Yue Peng
  • 101
  • 6
  • Do you mean you want to drop all rows that do not contain these words in that column? I assume from [this question](https://stackoverflow.com/q/61916583/7508700) that you will first be reducing the article column down to just the first three or four sentences prior to filtering? – David Buck May 20 '20 at 19:46
  • Yes, I want to drop all the rows that do not contain those words in that column, but I do not want to reduce the article column down to just the first three or four sentences. I hope to keep the full articles after filtering. :) – Yue Peng May 21 '20 at 14:54

4 Answers4

1

First we define a function to return a boolean based on whether your keywords appear in a given sentence:

def contains_covid_kwds(sentence):
    kw1 = 'COVID19'
    kw2 = 'China'
    kw3 = 'Chinese'
    return kw1 in sentence and (kw2 in sentence or kw3 in sentence)

Then we create a boolean series by applying this function (using Series.apply) to the sentences of your df.article column.

Note that we use a lambda function in order to truncate the sentence passed on to the contains_covid_kwds up to the fifth occurrence of '\n', i.e. your first four sentences (more info on how this works here):

series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))

Then we pass the boolean series to df.loc, in order to localize the rows where the series was evaluated to True:

filtered_df = df.loc[series]
jfaccioni
  • 7,099
  • 1
  • 9
  • 25
  • Thanks for the answer. could you please elaborate on what `s[:s.replace('\n', '#', 4)`means? – Yue Peng May 21 '20 at 19:49
  • 1
    `s.replace('\n', '#', 4)` returns the same string but with the first 4 occurrences of `'\n'` replaced by `'#'`. The replacement symbol is not relevant: we simply do this because we then use `.find('\n')` on the returned string to find the index where the next `'\n'` is. Since we just replaced the first 4 `'\n'`s, this gives us the location of the fifth `'\n'`, which is where your fourth sentence ends. Then we simply take this index and slice the original string with `s[:index_of_the_fifth_newline_char]`. – jfaccioni May 21 '20 at 20:22
  • It's a bit convoluted because we need to do these operations for each element in your column, but *before* actually passing them to the `contain_covid_kwds` function. This would not be necessary if you perform this "take-the-first-four-sentences-of-the-string" filter beforehand. – jfaccioni May 21 '20 at 20:23
1

You can use pandas apply method and do the way I did.

string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})

def findKeys(string):
    string_list = string.strip().lower().split('\n')
    flag=0
    keywords=['china','covid-19','wuhan']

    # Checking if the article has more than 4 sentences
    if len(string_list)>4:
        # iterating over string_list variable, which contains sentences.
        for i in range(4):
            # iterating over keywords list
            for key in keywords:
                # checking if the sentence contains any keyword
                if key in string_list[i]:
                    flag=1
                    break
    # Else block is executed when article has less than or equal to 4 sentences
    else:
        # Iterating over string_list variable, which contains sentences
        for i in range(len(string_list)):
            # iterating over keywords list
            for key in keywords:
                # Checking if sentence contains any keyword
                if key in string_list[i]:
                    flag=1
                    break
    if flag==0:
        return False
    else:
        return True

and then call the pandas apply method on df:-

df['Contains Keywords?'] = df['article'].apply(findKeys)
Kunal
  • 23
  • 4
  • But your code doesn´t respond to this - "COVID-19" AND ("China" OR "Chinese") – Yue Peng May 21 '20 at 15:15
  • You need to type the keywords in all lowercase letters. I specifically made everything lower case so that a keyword is not missed due to case difference – Kunal May 21 '20 at 15:24
  • That's smart. But what I mean is a string contains EITHER China OR Chinse should be kept. The difference between these two words are not lower vs upper case – Yue Peng May 21 '20 at 15:32
0

Here:

found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
    if s1 in string and (s2 in string or s3 in string):
        found.append(string)
Community
  • 1
  • 1
Red
  • 26,798
  • 7
  • 36
  • 58
  • The condition `(s2 or s3)` will always be True for non-empty strings whatever those strings contain, ..so it doesn't really check it correctly. It should be (`s2 in string or s3 in string`). Also, the string to check is in a dataframe: "*I have a pandas dataframe called df. It has a column called article*". – Gino Mempin May 21 '20 at 00:20
  • Sorry, I made a typo. – Red May 21 '20 at 00:24
  • It's still not correct. `in` has a higher precedence than `or`, so it's equivalent to `s2 or (s3 in string)`, and `s2` again will always be True whatever it contains, so the check `s3 in string` becomes useless. [Check it for yourself](https://i.stack.imgur.com/eURq9.png). – Gino Mempin May 21 '20 at 00:26
0

First I create a series which contains just the first four sentences from the original `df['articles'] column, and convert it to lower case, assuming that searches should be case-independent.

articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()

Then use a simple boolean mask to filter only those rows where the keywords were found in the first four sentences.

df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]
David Buck
  • 3,752
  • 35
  • 31
  • 35