-2

Using python, given that string = "Tiësto & Sevenn - BOOM (Artelax Remix)" which contains non-ascii characters, how do I use unidecode to fix the string so stripped clean of non-ascii characters?

string = random.choice(list(open('data.csv'))).rstrip()
print "[+] Starting search for:", string

artistname = string.rsplit(' - ', 1)[0]
songname = string.rsplit(' - ', 1)[1]

The snip above gives me: artistname = Tiësto & Sevenn songname = BOOM (Artelax Remix)

As you can see, the artistname still contains non-ascii characters. How do I use unidecode to fix this issue?

god
  • 65
  • 2
  • 8
  • 2
    Did you read the [usage examples](https://pypi.python.org/pypi/Unidecode)? Did you make *any* attempt to figure out how to use unidecode? – user2357112 Sep 01 '17 at 20:21
  • What have you tried so far? Are you wanting to remove them or replace them? In your example, do you want `"Tiesto & Sevenn"` or `"Tisto & Sevenn"` or something else? – Zach Gates Sep 01 '17 at 20:21
  • Yes. I've tried unidecode(u'string'). I want the ë character to be changed to e, not to remove it all together. – god Sep 01 '17 at 20:25
  • unidecode does that. – user2357112 Sep 01 '17 at 20:29

1 Answers1

2

Simply call unidecode on your string (unquoted):

>>> from unidecode import unidecode
>>> unidecode(string)
'Tiesto & Sevenn - BOOM (Artelax Remix)'

There's also the longer/slower route of removing combining characters after normalising into a decomposed form:

>>> import unicodedata
>>> ''.join(s for s in unicodedata.normalize('NFD', string) if not unicodedata.combining(s))
'Tiesto & Sevenn - BOOM (Artelax Remix)'
Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
  • 1
    unidecode(string) --- This throws an exception or warning because some of the strings in my data.csv file are good to go and don't need to be converted though unidecode. /usr/lib64/python2.7/site-packages/unidecode/__init__.py:46: RuntimeWarning: Argument is not an unicode object. Passing an encoded string will likely have unexpected results. Would it make more sense to a sanitize my data.csv converting all non-ascii characters in the file vs when I pull out the string? – god Sep 01 '17 at 20:45
  • 1
    @god: You need to actually read the data *as unicode* before you sanitize it. Use [`codecs.open`](https://docs.python.org/3/library/codecs.html#codecs.open), and specify the correct encoding. – user2357112 Sep 01 '17 at 20:52