2

I have a few words in a list that are of the type '\uword'. I want to replace the '\u' with an empty string. I looked around on SO but nothing has worked for me so far. I tried converting to a raw string using "%r"%word but that didn't work. I also tried using word.encode('unicode-escape') but haven't gotten anywhere. Any ideas?

EDIT

Adding code

word = '\u2019'
word.encode('unicode-escape')
print(word) # error

word = '\u2019'
word = "%r"%word
print(word) # error
Clock Slave
  • 7,627
  • 15
  • 68
  • 109
  • 4
    Please include some code, showing what you have already tried. – turnip Feb 20 '17 at 13:07
  • 1
    `'\uword'.replace(r'\u', '')` -> `'word'` – martineau Feb 20 '17 at 13:12
  • 1
    replace `\\u` with '' – MohaMad Feb 20 '17 at 13:14
  • @Petar added code – Clock Slave Feb 20 '17 at 13:28
  • @ClockSlave when you say you get an error, what do you mean? Running the given code does not produce errors. – turnip Feb 20 '17 at 13:41
  • @Petar I am getting `UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 0: character maps to ` for both the cases. I'm using python 3 – Clock Slave Feb 20 '17 at 13:50
  • @Petar are you entering `word` on the console or `print(word)`? entering `word` doesn't give an error but printing does – Clock Slave Feb 20 '17 at 13:51
  • @martineau running that gives `SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape` – Clock Slave Feb 20 '17 at 13:52
  • Are you using Python 2 or Python 3? – jwodder Feb 20 '17 at 13:55
  • @jwodder using python 3 – Clock Slave Feb 20 '17 at 13:55
  • @ClockSlave I am having to test it here: https://repl.it/languages/python3 at the moment; I am not getting any errors. – turnip Feb 20 '17 at 13:58
  • @Petar using the `.encode` method, the `print(word)` line outputs `’` in the console you linked. I am still getting the error when I run the code on my command prompt. I guess I know what I was doing wrong. I was assuming '.encode' is an inplace method like `.sort()` method of lists. I thought that it would change the string. I read the docs, it says `.encode()` >>returns a bytes representation of the Unicode string, encoded in the requested encoding – Clock Slave Feb 20 '17 at 14:13
  • 1
    @ClockSlave yes, you are 100% correct. Good spot. You should submit it as the answer to your question. – turnip Feb 20 '17 at 14:16

4 Answers4

3

I was making an error in assuming that the .encode method of strings modifies the string inplace similar to the .sort() method of a list. But according to the documentation

The opposite method of bytes.decode() is str.encode(), which returns a bytes representation of the Unicode string, encoded in the requested encoding.

def remove_u(word):
    word_u = (word.encode('unicode-escape')).decode("utf-8", "strict")
    if r'\u' in word_u: 
        # print(True)
        return word_u.split('\\u')[1]
    return word

vocabulary_ = [remove_u(each_word) for each_word in vocabulary_]
Clock Slave
  • 7,627
  • 15
  • 68
  • 109
2

Given that you are dealing with strings only. We can simply convert it to string using the string function.

>>> string = u"your string"
>>> string
u'your string'
>>> str(string)
'your string'

Guess this will do!

vijay athithya
  • 1,529
  • 1
  • 10
  • 16
1

If I have correctly understood, you don't have to use regular expressions. Just try:

>>> # string = '\u2019'
>>> char = string.decode('unicode-escape')
>>> print format(ord(char), 'x')
2019
logi-kal
  • 7,107
  • 6
  • 31
  • 43
-2

Because you are facing problems with encodings and unicode it would be helpful to know the version of python you are using. I don't know if I get you right but this should do the trick:

string = r'\uword'
string.replace(r'\u','')
Nerade
  • 115
  • 7
  • I don't have a raw string. I have a string literal of the form `'\u2019'`. The above method doesn't work when `string = '\u2019'` – Clock Slave Feb 20 '17 at 13:31