0

I am trying the following

>>> import string
>>> s = 'https://google.com\n<0x03><0x03><0x03>'
>>> s.decode('utf8').encode('ascii', errors='ignore')

The expected output is:

'https://google.com'

But the hex characters and new line is not removed.

Tonechas
  • 13,398
  • 16
  • 46
  • 80
Sampaul
  • 3
  • 3
  • does this answer your query: https://stackoverflow.com/questions/36598136/remove-all-hex-characters-from-string-in-python? – Krishna Chaurasia Jan 28 '21 at 12:37
  • There are no non-ascii characters in your original input `'https://google.com\n<0x03><0x03><0x03>'` Edit: to clarify `\n` is valid ascii, `<0x03>` are just a series of six ascii characters and aren't raw bytes, also `\x03` is valid ascii – lvrf Jan 28 '21 at 16:56
  • why do you expect it will remove `\n` or other chars ? ASCII chars are probably from code 0 to 128 - so `03` is ASCII code. If you don't want text after '\n` then use `s = s.split('\n')[0]` – furas Jan 28 '21 at 17:03

1 Answers1

0

This code:

import string
import re
s = 'https://google.com\n<0x03><0x03><0x03>'
s=re.sub(r'[^ -~].*'.format(string.punctuation), '',s)
print(s)

gives this:

'https://google.com'
Younes
  • 391
  • 2
  • 9
  • You can find here a list of the characters the regex operator tild **~** works on [~ operator](https://catonmat.net/my-favorite-regex) – Younes Jan 28 '21 at 12:53