How to force sklearn CountVectorizer to not remove special characters (i.e. #, @, , $ or %)

Question

Here is my code:

count = CountVectorizer(lowercase = False)

vocabulary = count.fit_transform([words])
print(count.get_feature_names())

For example if:

 words = "Hello @friend, this is a good day. #good."

I want it to be separated into this:

['Hello', '@friend', 'this', 'is', 'a', 'good', 'day', '#good']

Currently, this is what it is separated into:

['Hello', 'friend', 'this', 'is', 'a', 'good', 'day']

Ankur Sinha · Accepted Answer · 2019-08-09T06:24:36.880

4

You can use the token_pattern parameter here from CountVectorizer as mentioned in the documentation:

Pass a regex to tell CountVectorizer what should be considered a word. Let's say in this case we tell CountVectorizer, even words with # or @ should be a word. Then do:

count = CountVectorizer(lowercase = False, token_pattern = '[a-zA-Z0-9$&+,:;=?@#|<>.^*()%!-]+')

Output:

['#good', '@friend', 'Hello', 'a', 'day', 'good', 'is', 'this']

edited Aug 09 '19 at 06:24

answered Aug 09 '19 at 06:10

Ankur Sinha

6,473
7
42
73

Also, how would I force CountVectorizer to ignore certain words? If words was: `words = “Hello @friend, this is a good day https://www.google.com/. #good."` I want it to still be separated into: `['Hello', 'friend', 'this', 'is', 'a', 'good', 'day']` Without the URL. Thanks a lot! – Felix H. Aug 09 '19 at 06:45
You can pass another parameter called stop_words and assign the list of words to be ignored. Please check the documentation link I posted, it is written there :) – Ankur Sinha Aug 09 '19 at 06:49
Here is a link: https://stackoverflow.com/questions/40124476/how-to-set-custom-stop-words-for-sklearn-countvectorizer – Ankur Sinha Aug 09 '19 at 06:50
Is there any way I could set the parameter such that it would remove any string containing `https` in order to remove any URL that appeared.? Thanks a lot again. – Felix H. Aug 09 '19 at 07:13
You can clean your string first by removing all https links doing: `words = re.sub(r"\bhttps:\//[a-z0-9.]*", '', words)` and then go ahead with the usual solution. :) – Ankur Sinha Aug 09 '19 at 09:34

How to force sklearn CountVectorizer to not remove special characters (i.e. #, @, , $ or %)

1 Answers1