Unable to encode or decode the string properly

Question

I tried to look at bunch of stackoverflow examples.

Python version used: Python 2.7.10

Output of the string s looked like

u'bh\xfcghi' where \xfc=ü

I am reading this from a webpage.

After I encode the string via .encode('utf-8'), it looks like

'bh\xc3\xbcghi' where \xc3\xbc=ü

Expected Output should be:

bhüghi

I even tried to decode/encode(latin-1), decode(utf-8).

After nfn neil comment I tried the following again:

elem.text output:

('elem text:', u'bh\xfcghi\nMCI\n8 90 1 0 0 2 0 0 0 0 0 0 2 26 41.4 18.5 89 14.9')

elem text type:

('elem text type:', <type 'unicode'>)

Now, I am trying to print it:

splitString = elem.text.encode('utf-8').decode("utf-8").split()
print("splitString: ", splitString[0])

SplitString[0] output:

u'bh\xfcghi'

Now if I print the whole string after split:

print("splitString: ", splitString)

SplitString output:

[u'bh\xfcghi', u'MCI', u'8', u'90', u'1', u'0', u'0', u'2', u'0', u'0', u'0', u'0', u'0', u'0', u'2', u'26', u'41.4', u'18.5', u'89', u'14.9']

Full code is in pastebin: Here's A link

Any help will be appreciated.

The issue is there is something happening that's making it not modify a string. It's not an encoding issue. — Neil, Apr 07 '17 at 00:55
[Pastebin link for the fullcode](https://pastebin.com/E3PNmCbW) — Jazzy, Apr 07 '17 at 00:57
I got it working, ` splitString = unicodedata.normalize('NFKD', elem.text).encode('ascii','ignore').split()` — Jazzy, Apr 07 '17 at 01:43

score 0 · Answer 1 · edited Apr 07 '17 at 00:57

0

s = u'bh\xfcghi\nMCI\n8 90 1 0 0 2 0 0 0 0 0 0 2 26 41.4 18.5 89 14.9'
s = s.encode('utf-8')
xs = s.split(' ')
print(xs[0])

Output:

bhüghi
MCI
8

Try it; it works. The reason you don't get your 'expected' output when just typing it at a terminal, is that Python uses \x escape codes when you don't use print.

edited Apr 07 '17 at 00:57

zondo

19,901
8
44
83

answered Apr 07 '17 at 00:47

xrisk

3,790
22
45

this won't work, please refer to pastebin for fullcode access – Jazzy Apr 07 '17 at 01:09

score 0 · Accepted Answer · answered Apr 09 '17 at 11:43

0

I got it working by using unicodedata library:

splitString = unicodedata.normalize('NFKD',
elem.text).encode('ascii','ignore').split()

answered Apr 09 '17 at 11:43

Jazzy

33
1
8

Unable to encode or decode the string properly

2 Answers2