5

I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field.

Below is one of the example:

(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

I want to replace all the \x.. with a ?

I explicitly type \xc2 as follows works

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
re.sub('\\\xc2', '?', line)
result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)'

But its not working if I write as follow:

re.sub('\\\x..', '?', line)

How I can write a regular expression to replace them all?

wim
  • 338,267
  • 99
  • 616
  • 750
kenneth171
  • 55
  • 5
  • Purely as an exersize for the reader - `re.sub('[\x80-\xff]', '?', line)`. but please don't do that, @wim's answer is what you should go for. – Andrew Gelnar Sep 28 '16 at 16:38

2 Answers2

3

There are better tools for this job than regex, you could try for example:

>>> line
'(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
>>> line.decode('ascii', 'ignore')
u'(13)p(5)example(3)com(0)'

That skips non-ascii characters. Or with replace, you can swap them for a '?' placeholder:

>>> print line.decode('ascii', 'replace')
(13)��������p����(5)example(3)com(0)

But the best solution is to find out what erroneous encoding/decoding caused the mojibake to happen in the first place, so you can recover data by using the correct code pages.

There is an excellent answer about unbaking emojibake here. Note that it's an inexact science, and a lot of the crucial information is actually in the comment thread under that answer.

Community
  • 1
  • 1
wim
  • 338,267
  • 99
  • 616
  • 750
-2

what about this?

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

pattern = r'\\x.+'
re.sub(pattern, r'?', line)
kmario23
  • 57,311
  • 13
  • 161
  • 150
  • 1
    This is wholly incorrect. The string is not a series of `\ ` and `x` characters followed by a pair of alphanumeric characters, the `\xNN`s are a representation of a byte outside the ASCII range. The `__repr__` of a python string (ambiguous term) catches these bytes and prints a representation of their hex value. – Andrew Gelnar Sep 28 '16 at 16:20
  • No, this is not working because \xc2 is not regarded as a normal string, this whole 'string' cannot be treated as a combination of individuals characters. – kenneth171 Oct 03 '16 at 00:49
  • No, this is not working because \xc2 is not regarded as a normal string, this whole 'string' cannot be treated as a combination of individuals characters. I found that I can use range of values in regex for this: re.sub(r'[\x03]|[\x8d]|[\xa0-\xaf]|[\xb0-\xbf]|[\xc0-\xcf]|[\xd0-\xdf]', '', line). But the downside is I need to know in advance the possible ranges in order to come up with this range. – kenneth171 Oct 03 '16 at 00:56
  • @kenneth171 Check my comment on the question. Ascii only uses `\x00-\x7f`, so you can use the range `\x80-\xff`. – Andrew Gelnar Oct 04 '16 at 12:31