2

Here is my code:

count = CountVectorizer(lowercase = False)

vocabulary = count.fit_transform([words])
print(count.get_feature_names())

For example if:

 words = "Hello @friend, this is a good day. #good."

I want it to be separated into this:

['Hello', '@friend', 'this', 'is', 'a', 'good', 'day', '#good']

Currently, this is what it is separated into:

['Hello', 'friend', 'this', 'is', 'a', 'good', 'day']
Felix H.
  • 75
  • 1
  • 6

1 Answers1

4

You can use the token_pattern parameter here from CountVectorizer as mentioned in the documentation:

Pass a regex to tell CountVectorizer what should be considered a word. Let's say in this case we tell CountVectorizer, even words with # or @ should be a word. Then do:

count = CountVectorizer(lowercase = False, token_pattern = '[a-zA-Z0-9$&+,:;=?@#|<>.^*()%!-]+')

Output:

['#good', '@friend', 'Hello', 'a', 'day', 'good', 'is', 'this']
Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73
  • Also, how would I force CountVectorizer to ignore certain words? If words was: `words = “Hello @friend, this is a good day https://www.google.com/. #good."` I want it to still be separated into: `['Hello', 'friend', 'this', 'is', 'a', 'good', 'day']` Without the URL. Thanks a lot! – Felix H. Aug 09 '19 at 06:45
  • You can pass another parameter called stop_words and assign the list of words to be ignored. Please check the documentation link I posted, it is written there :) – Ankur Sinha Aug 09 '19 at 06:49
  • Here is a link: https://stackoverflow.com/questions/40124476/how-to-set-custom-stop-words-for-sklearn-countvectorizer – Ankur Sinha Aug 09 '19 at 06:50
  • Is there any way I could set the parameter such that it would remove any string containing `https` in order to remove any URL that appeared.? Thanks a lot again. – Felix H. Aug 09 '19 at 07:13
  • You can clean your string first by removing all https links doing: `words = re.sub(r"\bhttps:\//[a-z0-9.]*", '', words)` and then go ahead with the usual solution. :) – Ankur Sinha Aug 09 '19 at 09:34