0

I have a lxml.objectify.StringElement which is named elm and is:

u'\u266b\u266b\u266b\u266b\u266b\u266b\u266bHow do you get a job on the Yahoo staff when you are older?\u266b\u266b\u266b\u266b\u266b?'

I want to turn it to a str:

str(elm)

But I get this error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-6: ordinal not in range(128)  
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
someone
  • 205
  • 3
  • 14
  • 1
    Python 2 or 3? Unicode and strings are handled very differently between the two versions. – aquavitae May 23 '14 at 13:55
  • 2
    And what do you expect the string output to be? Note that you *already* have Unicode text there. Why do you need a byte string? What encodings can you use for the bytes? – Martijn Pieters May 23 '14 at 13:59
  • I suggest you read this: http://www.joelonsoftware.com/articles/Unicode.html – aquavitae May 23 '14 at 14:01
  • this is the part of my code: string_of_words = str(elm) list_of_words = string_of_words.split(' ') array_of_words = np.array(list_of_words) I needed to do some operation to each word in the elm so I thought this way I could extract each word, any other better way? – someone May 23 '14 at 14:02
  • How much do you understand about Unicode vs. encoded text already? Perhaps you should first make sure you understand what the difference is. See the [Python Unicode HOWTO](https://docs.python.org/2/howto/unicode.html), the Joel on Software article aquavitae pointed you to, as well as [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html). – Martijn Pieters May 23 '14 at 14:06
  • `unicode(elm).encode('utf-8')` should work... But if you deal with unicode, you should really directly work with it instead of needing to convert it to a `str`. – mata May 23 '14 at 14:09
  • using `utf-8` I got this `ΓÖ½ΓÖ½ΓÖ½ΓÖ½ΓÖ½ΓÖ½ΓÖ½How do you get a job on the Yahoo staff when you are older?ΓÖ½ΓÖ½ΓÖ½ΓÖ½ΓÖ½?` I guess this is not what you are looking for. – ρss May 23 '14 at 14:12
  • it gave me: '\xe2\x99\xab\xe2\x99\xab\xe2\x99\xab\xe2\x99\xab\xe2\x99\xab\xe2\x99\xab\xe2\x99\xabHow do you get a job on the Yahoo staff when you are older?\xe2\x99\xab\xe2\x99\xab\xe2\x99\xab\xe2\x99\xab\xe2\x99\xab?' but I need to know what are in elm :( – someone May 23 '14 at 14:13
  • That's the utf-8 encoded `str` version of your `elem`. What else do you need? – mata May 23 '14 at 14:20
  • I am reading a senescence from a file and trying to see what words we have in it, and filter some of the. but my program crashes at this point. – someone May 23 '14 at 14:25
  • this is the sentence that I am reading from xml file: ♫♫♫♫♫♫♫How do you get a job on the Yahoo staff when you are older?♫♫♫♫♫? – someone May 23 '14 at 14:28
  • is there any way that I can make the program ignore the error and continue to the other sentences? – someone May 23 '14 at 14:29
  • That's a very basic Python question. The answer is `try`/`except`. But there obviously is no workaround for the fact that basic byte strings can only accommodate a subset of the full Unicode repertoire. How do you plan to cope if your input contains *voilá* or *Ångström*? – tripleee May 23 '14 at 18:07

2 Answers2

0

I've run into a similar situation and something like this worked for me (I can't find the code now):

a=u'\u266b\u266b\u266b\u266b\u266b\u266b\u266bHow do you get a job on the Yahoo staff when you are older?\u266b\u266b\u266b\u266b\u266b?'
print bytes(a.encode('utf-32'))

But I get this with your string:

��k&k&k&k&k&k&k&How do you get a job on the Yahoo staff when you are older?k&k&k&k&k&?

Hah! I know this may not help you, but maybe it will be a step in the right direction. By the way, you might want to try Python 3+, it's much better at unicode.

peterjwest
  • 4,294
  • 2
  • 33
  • 46
notorious.no
  • 4,919
  • 3
  • 20
  • 34
0

You don't need any conversions, file content is unicode by default. Just remove str. All the string methods are applicable on unicode, so splitting will be OK. If you want unicode out of some object, try use of unicode instead of str

Ivan Klass
  • 6,407
  • 3
  • 30
  • 28
  • thanks for your reply. but the elm is a lxml.objectify.StringElement and it does not support splitting :( – someone May 30 '14 at 01:39
  • @someone, have you tred using `unicode` instead of `str`? Also there probably is some unicode value text field inside this object. Have you inspected that? – Ivan Klass May 30 '14 at 01:43