0

Let's say I have an unicode variable:

uni_var = u'Na teatr w pi\xc4\x85tek'

I want to have a string, which will be the same as uni_var, just without the "u", so:

str_var = 'Na teatr w pi\xc4\x85tek'

How can I do it? I would like to find something like:

str_var = uni_var.text()
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
pbialy
  • 1,025
  • 14
  • 26
  • Why are you using UTF-8 bytes in a Unicode string? Shouldn't that be `u'Na teatr w pi\u0105tek'`? – Martijn Pieters Feb 13 '15 at 10:51
  • 2
    Bytes are **encoded text**. So if you have a unicode string, you need to *encode* to bytes. If you mean you have codepoints in your Unicode string that are really UTF-8 bytes, you can unbreak that by encoding to Latin-1 then decoding again as UTF-8. – Martijn Pieters Feb 13 '15 at 10:52
  • @Martijn Well I would be happy if it would be like this, but [code]u'Na teatr w pi\xc4\x85tek'[\code] is what I'm getting from a source which I can't change. – pbialy Feb 13 '15 at 10:54
  • I don't understand where my question is unclear. But Your solution works, thx for help ;) – pbialy Feb 13 '15 at 11:02
  • Well solution works for my case, but in general it's not the answer - I guess there's no method like "get_text()" for unicode then? – pbialy Feb 13 '15 at 11:06
  • A bytestring is not text, no. Unicode is actual text, `str` is just a sequence of 0-255 integers that most people use to represent text, and forget that it's encoded still. So `65` is shown as `A` by Python, and that's great, but never forget that that's because something somewhere accepted the `65` byte and drew an `A` shape on the screen. :-) `unicode` values are a better model of the concept, but in the end if you print Unicode encoding takes place so something can draw those lines. – Martijn Pieters Feb 13 '15 at 11:46
  • But just to be clear, there are a lot of misunderstandings about Unicode and byte encodings, and there were any number of different ways your *can I get text* question could be interpreted depending on what misconception was being applied. The number of people that have manually typed in what they *thought* was in the `unicode` object rather than give us the actual value is huge, for example. And your question is missing context as to how you got that value, for example. – Martijn Pieters Feb 13 '15 at 11:52

2 Answers2

2

You appear to have badly decoded Unicode; those are UTF-8 bytes masking as Latin-1 codepoints.

You can get back to proper UTF-8 bytes by encoding to a codec that maps Unicode codepoints one-on-one to bytes, like Latin-1:

>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> uni_var.encode('latin1')
'Na teatr w pi\xc4\x85tek'

but be careful; it could also be that the CP1252 encoding was used to decode to Unicode here. It all depends on where this Mojibake was produced.

You could also use the ftfy library to detect how to best repair this; it produces Unicode output:

>>> import ftfy
>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> ftfy.fix_text(uni_var)
u'Na teatr w pi\u0105tek'
>>> print ftfy.fix_text(uni_var)
Na teatr w piątek

The library will handle CP1252 Mojibake's automatically.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

You need to encode your string to Latin-1

>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> uni_var.encode('Latin-1')
'Na teatr w pi\xc4\x85tek'
styvane
  • 59,869
  • 19
  • 150
  • 156