0

I have the following dataset

                                   Text
country     file                          
US          file_US                The Dish: Lidia Bastianich shares Italian recipes ... - CBS News
            file_US                Blog - Tasty Yummies
            file_US                Acne Alternative Remedies: Manuka Honey, Tea Tree Oil ...
            file_US                Looking back at 10 years of Downtown Arts | Times Leader 

IT          filename_IT            Tornando indietro a ...
            filename_IT            Questo locale è molto consigliato per le famiglie
                                                                            ...                                 
            filename_IT            Ci si chiede dove poter andare a mangiare una pizza  Melanzana Capriccia ...
            filename_IT            Ideale per chi ama mangiare vegano
              

with country and file indices. I want to apply a function which remove stopwords based on the value of the index:

def removing(sent):
    
    if df.loc['US','UK']:
        stop_words = stopwords.words('english')
    if df.loc['ES']:
        stop_words = stopwords.words('spanish')    
    
# (and so on)
                      
    c_text = []

    for i in sent.lower().split():
        if i not in stop_words:
            c_text.append(i)

    return(' '.join(c_text))

df['New_Column'] = df['Text'].astype(str)
df['New_Column'] = df['New_Column'].apply(removing)

Unfortunately I am getting this error:

----> 6 if df.loc['US']: 7 stop_words = stopwords.words('english') 8 if df.loc['ES']:

/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in nonzero(self) 1477 def nonzero(self): 1478 raise ValueError( -> 1479 f"The truth value of a {type(self).name} is ambiguous. " 1480 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 1481 )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

and I am still not understanding how to fix it. Can you please tell me how I can run the code without getting the error?

still_learning
  • 776
  • 9
  • 32
  • Please provide a [mcve]. – AMC Jun 30 '20 at 00:33
  • 1
    Does this answer your question? [Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) – AMC Jun 30 '20 at 00:34
  • Some people just leave just downvote my answer without left a single word , so I will remove it. Hope you get the Idea not use for loop when you have panda and numpy – BENY Jun 30 '20 at 00:43
  • it was not me, @YOBEN_S. I have just open stackoverflow – still_learning Jun 30 '20 at 00:46
  • 1
    @still_learning I know , no problem , hope you already get the method np.where ~ – BENY Jun 30 '20 at 00:47
  • @AMC, I provided an example of sentences that I would need to clean. For each country I have 10 similar sentences. Country is an index. I have already seen that question but unfortunately I have still not completely understood how to change accordingly my code – still_learning Jun 30 '20 at 00:47
  • @YOBEN_S, unfortunately I am not getting how to fix the issue looking at the suggested answer by AMC – still_learning Jun 30 '20 at 00:49
  • 1
    @still_learning 1st that is not his answer , 2nd your problem is different from what he linked – BENY Jun 30 '20 at 00:50
  • @YOBEN_S _2nd your problem is different from what he linked_ Can you elaborate? I thought they were quite similar. – AMC Jun 30 '20 at 01:09
  • unfortunately I have not fixed my issue yet, even after the question mentioned by AMC. If you have any advice to give me or a solution, I would greatly appreciate – still_learning Jun 30 '20 at 01:19

3 Answers3

2
#Assuming you have imported all the required libraries
#Make a dictionary with country code & language
lang={'UK':'english','US':'english','ES':'spanish'}
#assuming your dataframe as df
for index,row in df.iterrows():
   df.loc[index,'TEXT']=' '.join([word for word in str(row['TEXT']).split(' ') if word not in stopwords.words(lang[index])])

Updated answer:

 import pandas as pd
 import numpy as np
 import nltk
 from nltk.corpus import stopwords
 ind=pd.MultiIndex.from_arrays([['ind','ind','ind','ind','aus','aus','aus','aus'], ['1','2','3','4','5','6','7','8']])
 df=pd.DataFrame(['he is boy','he is boy','he is boy','he is boy','he is boy','he is boy','he is boy','he is boy'],index=ind,columns=['text'])
 lang={'ind':'spanish','aus':'english'}
 for index,row in df.iterrows():
       df.at[(index[0],index[1]),'text']=' '.join([word for word in str(row['text']).split(' ') if word not in stopwords.words(lang[index[0]])])

Before running loop:

enter image description here

After running loop:

enter image description here

Do try to take reference from the example I used!!

Mehul Gupta
  • 1,829
  • 3
  • 17
  • 33
  • Thank you Mehul Gupta. I have a problem to apply your code since there is another index column in my dataset. How can I select only Country? I am receiving this message: `KeyError: ('UK', 'original+key')` – still_learning Jun 30 '20 at 12:25
  • Can you tell me how many levels of the index are there in the dataset? Attaching a screenshot can be helpful – Mehul Gupta Jun 30 '20 at 14:10
  • I have tried to attach the screenshot, but I am having problem to save the changes. There are two indices: `country` and `name`. (I will keep trying to update the question with the screenshot... I hope it may be fixed soon) – still_learning Jun 30 '20 at 18:34
  • 1
    that is ['country','name'] is your index. Right? If so, keep everything as it is & change index to index[0] in stopwords.words() !! if this works, do let me know, I will update my answer as well – Mehul Gupta Jul 01 '20 at 04:05
  • Hi @Mehul Gupta, sorry for my late reply. I have got the following error: after changing as you suggested in stopwords: `stopwords.words(lang[index[0]])])` -> `TypeError: 'int' object is not subscriptable` – still_learning Jul 03 '20 at 18:09
  • Can you add a screenshot of your data? using df.head() – Mehul Gupta Jul 04 '20 at 04:01
  • I updated the dataset to show how my data looks like. Please let me know if it is better and/or if you have any further questions. Thanks a lot for your help – still_learning Jul 04 '20 at 18:14
  • 1
    Updated my answer & tried using a similar example as your data. Do inform if this works!! – Mehul Gupta Jul 05 '20 at 04:41
2

Here is how you can use numpy.where():

import pandas as pd
from numpy import where

df = pd.DataFrame(...)

# Remove the english stopwords from the english sentences
c = ['US','UK']
for p in c:
    stop_words = stopwords.words('english')
    for w in stop_words:
        df['Text'] = where(df['country'] == p, # If the country is english
                              df['Text'].str.replace(w,''), # Replace each stopword in each sentence with blank
                              df['Text'])


# Remove the spanish stopwords from the spanish sentences
stop_words = stopwords.words('spanish')
for w in stop_words:
    df['Text'] = where(df['country'] == 'ES', # If the country is spanish
                          df['Text'].str.replace(w,''), # Replace each stopword in each sentence with blank
                          df['Text'])
Red
  • 26,798
  • 7
  • 36
  • 58
  • @still_learning can you show me the dataframe's base? – Red Jul 04 '20 at 21:24
  • @still_learning You can also try changing `df['country']` to `df.index`. – Red Jul 04 '20 at 21:27
  • I updated the dataset within the question. It looks like as shown there :) I just edited the Text column to avoid confusion. When you say "change `df['country']` to `df.index`", how can I select country index? – still_learning Jul 04 '20 at 21:35
  • @still_learning I mean the base of the dataframe, like `pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})` – Red Jul 04 '20 at 21:41
  • can you provide a screenshot of it? – Red Jul 04 '20 at 21:45
  • Try changing all the `df['country']` to `df[['country']]`. – Red Jul 04 '20 at 21:56
  • If that doesn't work, change `df['country']` to `df['path']`. – Red Jul 04 '20 at 21:58
  • When I change `df['country']` to `df[['country']]` (or to path) I get this error: `KeyError: "None of [Index(['country'], dtype='object')] are in the [columns]"` – still_learning Jul 04 '20 at 22:02
  • Is the dataset in the very top generated by pandas, or is that a copy paste from the csv? – Red Jul 04 '20 at 22:03
  • Is the dataset in the very top generated by pandas? – Red Jul 04 '20 at 22:05
  • I think there is a problem with the loop as it replicates the same text through all the rows – still_learning Jul 04 '20 at 23:21
  • I tested it on a dummy dataframe, and it worked fine. – Red Jul 04 '20 at 23:22
  • Maybe I am doing something wrong, but I have just copied the code you suggested, after transforming the index to a column in the original dataset. Could you please provide an example of that? – still_learning Jul 04 '20 at 23:30
-1

define your function with

thecountry = x["Country"]
if thecountry == "UK" or thecountry=="US"
x["text"] = remove_stopwords("English")

... (etc)

And then df["filtered"] = df.apply(removing, axis=1)

Igor Rivin
  • 4,632
  • 2
  • 23
  • 35