12

I have a web scraper that takes forum questions, splits them into individual words and writes it to the text file. The words are stored in a list of tuples. Each tuple contains the word and its frequency. Like so...

[(u'move', 3), (u'exploration', 4), (u'prediction', 21),
 (u'find', 5), (u'user', 2), (u'interface', 2), (u'pleasant', 2),
 (u'am', 11), (u'puzzled', 2), (u'find', 5), (u'way', 5),
 (u'prediction', 21), (u'mode', 2), (u'have', 21),
 (u'explored', 2), (u'file', 9), (u'Can', 7), (u'help', 6),
 (u'Possible', 1), (u'bug', 2), (u'data', 31), (u'is', 17)

however, some person on the forum used the character \u200b which breaks all my code because that character is no longer a Unicode whitespace.

(u'used\u200b', 1)

Printing it out does not produce an error, but writing to a text file does. I have found that string.strip() and string.replace() do not help, so I was wondering how to use the regex library to get rid of that character. I plan on parsing through the entire list of tuples to find it.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
ceilingfan999
  • 133
  • 1
  • 1
  • 6
  • Why do you say it's not Unicode whitespace character? That's quite literally what it is. [U+200B](https://www.fileformat.info/info/unicode/char/200B/index.htm) – tripleee Dec 18 '20 at 05:50

1 Answers1

20

I tested that with python 2.7. replace works as expected:

>>> u'used\u200b'.replace(u'\u200b', '*')
u'used*'

and so does strip:

>>> u'used\u200b'.strip(u'\u200b')
u'used'

Just remember that the arguments to those functions have to be Unicode literals. It should be u'\u200b', not '\u200b'. Notice the u in the beginning.

And actually, writing that character to a file works just fine.

>>> import codecs
>>> f = codecs.open('a.txt', encoding='utf-8', mode='w')
>>> f.write(u'used\u200bZero')

See resources:

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
roeland
  • 5,349
  • 2
  • 14
  • 28
  • 1
    `split()` and `replace` are not proper ways since you wont encounter with `\u200` always. – Mazdak Jul 21 '15 at 07:25
  • @Kasramvd you can give more than one character as argument to `strip`. And there are plenty of ways to replace more than one character as well (eg. using regex). – roeland Jul 21 '15 at 22:12