0

This is my first time trying out python in a very long time. I am trying to simply extract and print tweets to console using twython.

tw = Twython(APP_KEY, access_token=access_token)
search = tw.search(q='#python')
for tweet in search["statuses"]:
    print(tweet['user']['name'])
    print(tweet['text'])

usually a few tweets will print and then I run into this, while printing either the user name or the tweet text (varies depending on where the character occurs):

UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 139: character maps to

I have tried adding .encode('utf-8') or wrapping it in str(), but the closest I will get is b'text here', but obviously I just want the tweet text. Even tried tacking on decode(). I read that I have to tell python what kind of charset I want to encode, which I have been doing, but I still get the b'string here'. Alot of examples I find on the web are not for python3 also which makes it a little more difficult to find what I need.

Can someone point me in the right direction?

  • send the encoded bytes to stdout? Is that the only way? I was able to print it all, just need to get newlines in there now – user2744955 Sep 04 '13 at 02:03

2 Answers2

0

What is the character set used of your console? I assume it's ascii. '\u2026' is a legal character in utf-8 and however illegal in ascii.
When you print a string, python will try encode it with your console's default character set since a string is stored as an unicode sequence internally. The kind of error you encountered occurs if some characters in the string is not supported by the default character set.
You can change your locale lang to utf-8 and run your script again, btw, the unicode character '\u2026' is displayed as "…".

hago
  • 1,700
  • 2
  • 16
  • 18
0

You might find this page on the way the Twitter API handles character counting with UTF-8 characters useful in explaining why some UTF-8 characters will work at the end of a tweet and others won't:

https://dev.twitter.com/docs/counting-characters

As for your actual question, insert the following client_args definition into your code:

from twython import Twython

APP_KEY = "key"
APP_SECRET = "key-secret"
OAUTH_TOKEN = "token"
OAUTH_TOKEN_SECRET = "secret"

client_args = {
  "headers": {
    "accept-charset": "utf-8"
  }
}

twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

That should tell the Twitter API to accept the UTF-8 characters from your application. Then you just need to make sure that your script/code and all interfaces for it also accept UTF-8. Then all you need to do is create the character(s) you're after when typing the tweet or DM and send.

If the above client_args setting doesn't do it in conjunction with specifying your character set in shells and other programs, it might require playing around tith the specific headers being transmitted. You may, for example, find that "content-type" is a better header to set or need to include it (although it shouldn't be).

Most of my tweets are sent through Emacs (either Twittering Mode or a shell calling a Twython script within an Emacs buffer) and there is no trouble sending a whole range of UTF-8 characters, up to Unicode 5.1 or 5.2, I think.

I haven't actually needed to set the custom headers with my scripts, but that's because UTF-8 is my default character set for all of the following: Emacs, bash (shells), Firefox, Thunderbird, GPG (the last doesn't affect Twitter, but it's always worth encouraging the use of) and finally the Twitter API itself. If I had not already set all those other things to use UTF-8 by default then I'd almost certainly run into trouble with Unicode through shell scripts and possibly elsewhere too.

Finally, if you find that most UTF-8 characters can be sent through your script, but some (usually less common or relatively new) characters cannot, then chances are the reason is due to which version of Unicode is supported by your operating system and/or available character sets (fonts). If you run into this issue, then you're going to have real trouble because even if you manage to transmit the right character to Twitter, your computer won't be able to display it. On the other hand, if you reach that point you will at least see some of your tweet and the error messages will stop.

The Python Requests documentation and the Twython documentation provide additional detail on the format for sending (POSTing) customised headers and Wikipedia includes a list of header types.

The WikiPedia list is here:

https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

Unfortunately my stack account is only recently activated, so I can't link all of the useful stuff. You may need to check the Requests documentation (find "More complicated POST requests" section) and the Twython documentation (find "Manipulate the request headers, proxies, etc." section).

Ben
  • 3,981
  • 2
  • 25
  • 34