1

(Edit: my original question is posted here, but the issue has been resolved and the code below is correct). I am looking for advice on how to convert Unicode characters to Turkish characters. The following code (posted online) scrapes tweets for an individual user and outputs a csv file, but the Turkish characters come out as in Unicode characters, i.e. \xc4. I am using Python 3 on a mac.

import sys

default_encoding = 'utf-8'
if sys.getdefaultencoding() != default_encoding:
    reload(sys)
    sys.setdefaultencoding(default_encoding)

import tweepy #https://github.com/tweepy/tweepy
import csv
import string
import print

#Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""

def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with this method

#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

#initialize a list to hold all the tweepy Tweets
alltweets = []  

#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)

#save most recent tweets
alltweets.extend(new_tweets)

#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1

#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
    #print "getting tweets before %s" % (oldest)

    #all subsiquent requests use the max_id param to prevent duplicates
    new_tweets = api.user_timeline(screen_name =    screen_name,count=200,max_id=oldest)

    #save most recent tweets
    alltweets.extend(new_tweets)

    #update the id of the oldest tweet less one
    oldest = alltweets[-1].id - 1

transform the tweepy tweets into a 2D array that will populate the csv

outtweets = [[tweet.id_str, tweet.created_at, tweet.text)] for tweet in alltweets]

write the csv

with open('%s_tweets.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    writer.writerow(["id","created_at","text"])
    writer.writerows(outtweets)

pass

if __name__ == '__main__':

pass in the username of the account you want to download

get_all_tweets("")
bayrah
  • 189
  • 1
  • 3
  • 12
  • What happens if you *don't* encode `tweet.text`? – Mark Ransom Sep 12 '16 at 22:48
  • @MarkRansom if I enter just "tweet.text" instead of "tweet.text.encode("utf-8") I get the following error: "UnicodeEncodeError: 'ascii' codec can't encode character '\xd6' in position 55: ordinal not in range(128)" – bayrah Sep 12 '16 at 23:02
  • `setdefaultencoding()` is [not recommended](https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/). – Mark Tolonen Sep 13 '16 at 01:38

1 Answers1

5

The csv module docs recommend you specify the encoding when you open the file. (and also that you use newline='' so the CSV module can do its own handling for newlines). Don't encode Unicode strings when writing rows.

import csv

with open('test.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['id','created_at','text'])
    writer.writerows([[123, 456, 'Äβç']])
roeland
  • 5,349
  • 2
  • 14
  • 28
  • Got it, thank you. Now, when I open the file I have to import it specifically as a utf-8 file when I open it in Excel. I am assuming I will figure out a way for this to be the default so I don't have to do it every time. In addition, when I do import the data as follows, for some reason, the columns I set in Python no longer hold (i.e. id, created_at, and text are all one column). This is the modified code: – bayrah Sep 12 '16 at 23:28
  • I have edited the code above. If anyone has any further advice, please do let me know (on setting the import environment and dealing with columns). I can't use comma as the delimiter because tweets have commas within them. – bayrah Sep 12 '16 at 23:36
  • @bayrah then have a look at the rest of the docs. The CSV import settings (delimiters etc. ) have to match the way your script is writing the CSV file. – roeland Sep 12 '16 at 23:48
  • It opens fine when I directly open the csv file that's still in unicode character. The problem is when I import it and set the environment in Excel. Thank you for your help - the essential problem is solved. Worst case scenario, I can format afterward. – bayrah Sep 13 '16 at 01:07
  • 2
    @bayrah Use the `utf-8-sig` encoding for writing the file if you want to open it in Excel; otherwise, Excel assumes the file is in a localized encoding and not UTF-8. – Mark Tolonen Sep 13 '16 at 01:36
  • @bayrah the `csv` module has lots of arguments for formatting the output. Try using the tab character `\t` as a delimiter instead of comma. – Mark Ransom Sep 13 '16 at 03:14
  • utf-8-sig fixed everything- thanks. Turkish characters appeared and column formatting was not affected. – bayrah Sep 13 '16 at 04:06