2

I'm trying to convert non-English languages to English using TextBlob translate function. My data set is based on Pandas data frame.

I understood that it worked in non-Pandas data frame context. For example,

what=TextBlob("El apartamento de Evan esta muy bien situado, con fcil acceso al cualquier punto de Manhattan gracias al metro.")
whatt=what.translate(to= 'en')
print (whatt)

But based on Pandas data frame, TextBlob translate wouldn't work properly.
I searched for a way to address this and found the code but gave me a different error message. Could anyone help me with this?

data["comments"] = data["comments"].str.encode('ISO 8859-1', 'ignore').apply(lambda x: TextBlob(x.strip()).translate(to='en'))

TypeError: cannot use a string pattern on a bytes-like object
Todd
  • 399
  • 3
  • 18

1 Answers1

2

Interesting problem

import pandas as pd
data = { 'number' : [1,2], 'comments' : ['El apartamento de Evan','Manhattan gracias al metro' ] }
df = pd.DataFrame(data)

and then lets do the translation into a new string

df["commentst"] = df["comments"].apply(lambda x: str(TextBlob(x).translate(to='en')))

and that gives

    number  comments                    commentst
0   1       El apartamento de Evan      Evan's Apartment
1   2       Manhattan gracias al metro  Manhattan thanks to the subway

And here is a minimal trial

def get_english(message):
    analysis = TextBlob(message)
    language = analysis.detect_language()
    if language == 'en':
        return message
    return str(analysis.translate(to='en'))

df["commentst"] = df["comments"].apply(lambda x: get_english(x))
df

It gives the same with mine - but I am not sure with your data

Paul Brennan
  • 2,638
  • 4
  • 19
  • 26
  • 1
    Hi, Thanks for your answer! I have an additional question. If some English are mixed within comments with other languages, I noticed that it also gives an error (NotTranslated: Translation API returned the input string unchanged). I tried using try&except but Python wouldn't do anything. Do you have any idea how I can address this issue? – Todd Jan 14 '21 at 21:54
  • Do you know what languages you are coming from? It helps a lot with the translation. – Paul Brennan Jan 14 '21 at 23:30
  • 1
    I checked it and mostly it's in Spanish but seems like some other languages were too (customer reviews). Can't check every rows since there're too many of them. – Todd Jan 15 '21 at 02:22
  • Thank you for the update. I ran it on the sub-sample of my data and it works fine. It only leaves English as is. But I had around 27000 reviews and ran into an error -> HTTPError: Too Many Requests . This seems like not an error from your code but from some kind of restrictions or limits on the amount of data Python can process when it uses Google Translation API? – Todd Jan 15 '21 at 05:21
  • You are implicitly calling google translate and they may have limits. I don't know about that. – Paul Brennan Jan 15 '21 at 11:49