5

I'm being passed data that is ebcdic encoded. Something like:

s = u'@@@@@@@@@@@@@@@@@@@ÂÖÉâÅ@ÉÄ'

Attempting to .decode('cp500') is wrong, but what's the correct approach? If I copy the string into something like Notepad++ I can convert it from EBCDIC to ascii, but I can't seem to find a viable approach in python to achieve the same. For what it's worth, the correct result is: BOISE ID (plus or minus space padding).

The information is being retrieved from a file of lines of JSON objects. That file looks like this:

{ "command": "flush-text", "text": "@@@@@O@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@O" }
{ "command": "flush-text", "text": "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\u00C9\u00C4@\u00D5\u00A4\u0094\u0082\u0085\u0099z@@@@@@@@@@\u00D9\u00F5\u00F9\u00F7\u00F6\u00F8\u00F7\u00F2\u00F4" }
{ "command": "flush-text", "text": "@@@@@OmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmO" }
{ "command": "flush-text", "text": "@@@@@O@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@O" }

And the processing loop looks something like:

with open('myfile.txt', 'rb') as fh:
  for line in fh:
    data = json.loads(line)
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
g.d.d.c
  • 46,865
  • 9
  • 101
  • 111
  • Encode or decode? Why would you decode a Unicode string? Wouldn't you want to decode a byte string? – 200_success Jan 31 '16 at 08:42
  • Agreed. But ... what I'm getting is already in a unicode string - it came from `json.loads`. It's kind of a cluster, tbh. But I'm struggling with how to get from what I have to work with to what I need. – g.d.d.c Jan 31 '16 at 08:45
  • It sounds like you aren't really clear what you have to work with. – holdenweb Jan 31 '16 at 08:57
  • Are you using Python 2 or 3? – cdarke Jan 31 '16 at 08:59
  • 4
    Can you add a snipplet on how you get hold of this data in your python script? When it comes from a file, them maybe you should open in the right encoding.... – flaschbier Jan 31 '16 at 09:00
  • @cdarke - 2. I'd considered trying to use 3, but I'm not sure the libraries I'm using are all compatible. – g.d.d.c Jan 31 '16 at 09:00

2 Answers2

4

If Notepad++ converts it ok, then you should simply need:

Python 2.7:

with io.open('myfile.txt', 'r', encoding="cp500") as fh:
  for line in fh:
    data = json.loads(line)

Python 3.x:

with open('myfile.txt', 'r', encoding="cp500") as fh:
  for line in fh:
    data = json.loads(line)

This uses a TextWrapper to decode the file as it's read using the given decoding. io module provides Python 3 open to Python 2.x, with codecs/TextWrapper and universal newline support

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
3

My guess is that you need the value of the corresponding Unicode ordinals as bytes, and then decode that with cp500.

>>> s = u'@@@@@@@@@@@@@@@@@@@ÂÖÉâÅ@ÉÄ'
>>> bytearray(ord(c) for c in s).decode('cp500')
u'                   BOISE ID'

Alternatively:

>>> s.encode('latin-1').decode('cp500')
u'                   BOISE ID'
timgeb
  • 76,762
  • 20
  • 123
  • 145
  • This answer works correctly, but I'm interested in testing the `io` module further so I went with that approach. – g.d.d.c Jan 31 '16 at 19:31
  • @g.d.d.c the other answer is the correct approach. I assumed you got your unicode string from *somewhere* and had to deal with what you got. – timgeb Jan 31 '16 at 19:59