-1

The tweets I capture when streaming with Tweepy come in Unicode special characters and I need them to be letters. I have found many solutions on the site but none of them seemed to work or even to apply to my case, since I’m collecting tweets in real time. Can anyone help?

Here’s my code:

from urllib3.exceptions import ProtocolError
from tweepy import Stream
from tweepy.auth import OAuthHandler
from tweepy.streaming import StreamListener
import time

ckey = 'your code here'
csecret = 'your code here'
atoken = 'your code here'
asecret = 'your code here'

class listener(StreamListener):
    
    def on_data(self, data):
        while True:
            try:
                #print (data)
                tweet = data.split(',"text":"')[1].split('","')[0]
                tweet2 = data.split(',"screen_name":"')[1].split('","location')[0]
                print (tweet2,tweet)
                saveFile = open ('test.csv','a')
                saveFile.write('@')
                saveFile.write(tweet2)
                saveFile.write(';')
                saveFile.write(tweet)
                saveFile.write('\n')
                saveFile.close()
                return True
        
            except ProtocolError:
                continue
            except BaseException as e:
                print ('Failed on data', str(e))
                break
    
        def on_error(self, status):
            print (status)

auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=['keyword'])

Here's my output for the keyword "fluminense":

adrianabpadilha Impressionante como mesmo com poucas op\u00e7\u00f5es para o banco o Burro s\u00f3 me sobe o Wisney e o Higor! Pq n\u00e3o levar o Pato\u2026 https:\/\/t.co\/lO4CJJsaaP
Miguel_Aalmeida RT @pulligffc: O Fluminense em dia de jogo olha pra mim e faz isso
TRANQUILINHO3 Time fdpt \ud83d\ude20
LeleoCasttroo @jrmenini @FFvinho Palmeiras e Fluminense ainda tiveram a base como fonte de renda, atl\u00e9tico n\u00e3o revela um jogador\u2026 https:\/\/t.co\/ZF8awS6pDt
SouzaArthur6 @CezarSabia @andreisilvasoar @ndrzej87 @futebol_info C\u00e9zar, existe um tempo certo de testagem, q se d\u00e1 no 5\u00b0 da doe\u2026 https:\/\/t.co\/zmBlBzafdo
Thomasrodrigue_ @renatojr_07 \u00c9 o mesmo exemplo da final da ta\u00e7a rio, a \u00fanica coisa que muda \u00e9 que na final n\u00e3o tinha jogador contam\u2026 https:\/\/t.co\/3Q2nCBw9XS

As you can see, some characters like "ç" and "õ" are shown as "/u00e7" and "\u00f5" respectively.

Thank you!

yuko
  • 3
  • 3

1 Answers1

1

This occurs because of the encoding character problem You can decode the string using unicode_escape encoding

for example

s = r'\u00e7'
print s
\u00e7 #output
print s.decode('unicode-escape')
ç #output
Darkknight
  • 1,716
  • 10
  • 23
  • Hi! I tried to apply this to my 'tweet' object like this: `print (tweet2,tweet.decode('unicode-escape')` The output now returns this error: `"Failed on data 'str' object has no attribute 'decode'"` – yuko Feb 04 '21 at 00:09