Filtering country to apply different stopwords

Question

I have the following dataset

                                   Text
country     file                          
US          file_US                The Dish: Lidia Bastianich shares Italian recipes ... - CBS News
            file_US                Blog - Tasty Yummies
            file_US                Acne Alternative Remedies: Manuka Honey, Tea Tree Oil ...
            file_US                Looking back at 10 years of Downtown Arts | Times Leader 

IT          filename_IT            Tornando indietro a ...
            filename_IT            Questo locale è molto consigliato per le famiglie
                                                                            ...                                 
            filename_IT            Ci si chiede dove poter andare a mangiare una pizza  Melanzana Capriccia ...
            filename_IT            Ideale per chi ama mangiare vegano

with country and file indices. I want to apply a function which remove stopwords based on the value of the index:

def removing(sent):
    
    if df.loc['US','UK']:
        stop_words = stopwords.words('english')
    if df.loc['ES']:
        stop_words = stopwords.words('spanish')    
    
# (and so on)
                      
    c_text = []

    for i in sent.lower().split():
        if i not in stop_words:
            c_text.append(i)

    return(' '.join(c_text))

df['New_Column'] = df['Text'].astype(str)
df['New_Column'] = df['New_Column'].apply(removing)

Unfortunately I am getting this error:

----> 6 if df.loc['US']: 7 stop_words = stopwords.words('english') 8 if df.loc['ES']:

/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in nonzero(self) 1477 def nonzero(self): 1478 raise ValueError( -> 1479 f"The truth value of a {type(self).name} is ambiguous. " 1480 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 1481 )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

and I am still not understanding how to fix it. Can you please tell me how I can run the code without getting the error?

Does this answer your question? [Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) — AMC, Jun 30 '20 at 00:34
Some people just leave just downvote my answer without left a single word , so I will remove it. Hope you get the Idea not use for loop when you have panda and numpy — BENY, Jun 30 '20 at 00:43
@still_learning I know , no problem , hope you already get the method np.where ~ — BENY, Jun 30 '20 at 00:47
@AMC, I provided an example of sentences that I would need to clean. For each country I have 10 similar sentences. Country is an index. I have already seen that question but unfortunately I have still not completely understood how to change accordingly my code — still_learning, Jun 30 '20 at 00:47
@YOBEN_S, unfortunately I am not getting how to fix the issue looking at the suggested answer by AMC — still_learning, Jun 30 '20 at 00:49
@still_learning 1st that is not his answer , 2nd your problem is different from what he linked — BENY, Jun 30 '20 at 00:50
@YOBEN_S _2nd your problem is different from what he linked_ Can you elaborate? I thought they were quite similar. — AMC, Jun 30 '20 at 01:09
unfortunately I have not fixed my issue yet, even after the question mentioned by AMC. If you have any advice to give me or a solution, I would greatly appreciate — still_learning, Jun 30 '20 at 01:19

Mehul Gupta · Accepted Answer · 2020-07-05T04:41:03.743

2

#Assuming you have imported all the required libraries
#Make a dictionary with country code & language
lang={'UK':'english','US':'english','ES':'spanish'}
#assuming your dataframe as df
for index,row in df.iterrows():
   df.loc[index,'TEXT']=' '.join([word for word in str(row['TEXT']).split(' ') if word not in stopwords.words(lang[index])])

Updated answer:

 import pandas as pd
 import numpy as np
 import nltk
 from nltk.corpus import stopwords
 ind=pd.MultiIndex.from_arrays([['ind','ind','ind','ind','aus','aus','aus','aus'], ['1','2','3','4','5','6','7','8']])
 df=pd.DataFrame(['he is boy','he is boy','he is boy','he is boy','he is boy','he is boy','he is boy','he is boy'],index=ind,columns=['text'])
 lang={'ind':'spanish','aus':'english'}
 for index,row in df.iterrows():
       df.at[(index[0],index[1]),'text']=' '.join([word for word in str(row['text']).split(' ') if word not in stopwords.words(lang[index[0]])])

Before running loop:

After running loop:

Do try to take reference from the example I used!!

edited Jul 05 '20 at 04:41

answered Jun 30 '20 at 04:41

Mehul Gupta

1,829
3
17
33

Thank you Mehul Gupta. I have a problem to apply your code since there is another index column in my dataset. How can I select only Country? I am receiving this message: `KeyError: ('UK', 'original+key')` – still_learning Jun 30 '20 at 12:25
Can you tell me how many levels of the index are there in the dataset? Attaching a screenshot can be helpful – Mehul Gupta Jun 30 '20 at 14:10
I have tried to attach the screenshot, but I am having problem to save the changes. There are two indices: `country` and `name`. (I will keep trying to update the question with the screenshot... I hope it may be fixed soon) – still_learning Jun 30 '20 at 18:34
1

that is ['country','name'] is your index. Right? If so, keep everything as it is & change index to index[0] in stopwords.words() !! if this works, do let me know, I will update my answer as well – Mehul Gupta Jul 01 '20 at 04:05
Hi @Mehul Gupta, sorry for my late reply. I have got the following error: after changing as you suggested in stopwords: `stopwords.words(lang[index[0]])])` -> `TypeError: 'int' object is not subscriptable` – still_learning Jul 03 '20 at 18:09
Can you add a screenshot of your data? using df.head() – Mehul Gupta Jul 04 '20 at 04:01
I updated the dataset to show how my data looks like. Please let me know if it is better and/or if you have any further questions. Thanks a lot for your help – still_learning Jul 04 '20 at 18:14
1

Updated my answer & tried using a similar example as your data. Do inform if this works!! – Mehul Gupta Jul 05 '20 at 04:41

Red · Answer 2 · 2020-07-04T21:25:21.397

2

Here is how you can use numpy.where():

import pandas as pd
from numpy import where

df = pd.DataFrame(...)

# Remove the english stopwords from the english sentences
c = ['US','UK']
for p in c:
    stop_words = stopwords.words('english')
    for w in stop_words:
        df['Text'] = where(df['country'] == p, # If the country is english
                              df['Text'].str.replace(w,''), # Replace each stopword in each sentence with blank
                              df['Text'])


# Remove the spanish stopwords from the spanish sentences
stop_words = stopwords.words('spanish')
for w in stop_words:
    df['Text'] = where(df['country'] == 'ES', # If the country is spanish
                          df['Text'].str.replace(w,''), # Replace each stopword in each sentence with blank
                          df['Text'])

edited Jul 04 '20 at 21:25

answered Jul 04 '20 at 18:55

Red

26,798
7
36
58

@still_learning can you show me the dataframe's base? – Red Jul 04 '20 at 21:24
@still_learning You can also try changing `df['country']` to `df.index`. – Red Jul 04 '20 at 21:27
I updated the dataset within the question. It looks like as shown there :) I just edited the Text column to avoid confusion. When you say "change `df['country']` to `df.index`", how can I select country index? – still_learning Jul 04 '20 at 21:35
@still_learning I mean the base of the dataframe, like `pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})` – Red Jul 04 '20 at 21:41
can you provide a screenshot of it? – Red Jul 04 '20 at 21:45
Try changing all the `df['country']` to `df[['country']]`. – Red Jul 04 '20 at 21:56
If that doesn't work, change `df['country']` to `df['path']`. – Red Jul 04 '20 at 21:58
When I change `df['country']` to `df[['country']]` (or to path) I get this error: `KeyError: "None of [Index(['country'], dtype='object')] are in the [columns]"` – still_learning Jul 04 '20 at 22:02
Is the dataset in the very top generated by pandas, or is that a copy paste from the csv? – Red Jul 04 '20 at 22:03
Is the dataset in the very top generated by pandas? – Red Jul 04 '20 at 22:05
I think there is a problem with the loop as it replicates the same text through all the rows – still_learning Jul 04 '20 at 23:21
I tested it on a dummy dataframe, and it worked fine. – Red Jul 04 '20 at 23:22
Maybe I am doing something wrong, but I have just copied the code you suggested, after transforming the index to a column in the original dataset. Could you please provide an example of that? – still_learning Jul 04 '20 at 23:30

score -1 · Answer 3 · answered Jun 30 '20 at 00:19

-1

define your function with

thecountry = x["Country"]
if thecountry == "UK" or thecountry=="US"
x["text"] = remove_stopwords("English")

... (etc)

And then df["filtered"] = df.apply(removing, axis=1)

answered Jun 30 '20 at 00:19

Igor Rivin

4,632
2
23
35

No, but just `reset_index()` first, you can `set_index("Country")` later. – Igor Rivin Jun 30 '20 at 00:27

Filtering country to apply different stopwords

3 Answers3