1

I tried to look at bunch of stackoverflow examples.

Python version used: Python 2.7.10

Output of the string s looked like

u'bh\xfcghi' where \xfc=ü

I am reading this from a webpage.

After I encode the string via .encode('utf-8'), it looks like

'bh\xc3\xbcghi' where \xc3\xbc=ü

Expected Output should be:

bhüghi

I even tried to decode/encode(latin-1), decode(utf-8).

After nfn neil comment I tried the following again:

elem.text output:

('elem text:', u'bh\xfcghi\nMCI\n8 90 1 0 0 2 0 0 0 0 0 0 2 26 41.4 18.5 89 14.9')

elem text type:

('elem text type:', <type 'unicode'>)

Now, I am trying to print it:

splitString = elem.text.encode('utf-8').decode("utf-8").split()
print("splitString: ", splitString[0])

SplitString[0] output:

u'bh\xfcghi'

Now if I print the whole string after split:

print("splitString: ", splitString)

SplitString output:

[u'bh\xfcghi', u'MCI', u'8', u'90', u'1', u'0', u'0', u'2', u'0', u'0', u'0', u'0', u'0', u'0', u'2', u'26', u'41.4', u'18.5', u'89', u'14.9']

Full code is in pastebin: Here's A link

Any help will be appreciated.

Jazzy
  • 33
  • 1
  • 8
  • The issue is there is something happening that's making it not modify a string. It's not an encoding issue. – Neil Apr 07 '17 at 00:55
  • [Pastebin link for the fullcode](https://pastebin.com/E3PNmCbW) – Jazzy Apr 07 '17 at 00:57
  • I got it working, ` splitString = unicodedata.normalize('NFKD', elem.text).encode('ascii','ignore').split()` – Jazzy Apr 07 '17 at 01:43

2 Answers2

0
s = u'bh\xfcghi\nMCI\n8 90 1 0 0 2 0 0 0 0 0 0 2 26 41.4 18.5 89 14.9'
s = s.encode('utf-8')
xs = s.split(' ')
print(xs[0])

Output:

bhüghi
MCI
8

Try it; it works. The reason you don't get your 'expected' output when just typing it at a terminal, is that Python uses \x escape codes when you don't use print.

zondo
  • 19,901
  • 8
  • 44
  • 83
xrisk
  • 3,790
  • 22
  • 45
0

I got it working by using unicodedata library:

splitString = unicodedata.normalize('NFKD',
elem.text).encode('ascii','ignore').split()
Jazzy
  • 33
  • 1
  • 8