-1

I want to extract #hashtags from a string, also those that have special characters such as #1+1.

Currently I'm using:

@hashtags ||= string.scan(/#\w+/)

But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.

How do I do this?

EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...

Also, the hash sign at the beginning should be removed.

Casper
  • 33,403
  • 4
  • 84
  • 79
Gal Ben-Haim
  • 17,433
  • 22
  • 78
  • 131

3 Answers3

1

The Solution

You probably want something like:

'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]

Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.

References

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199
  • the beginning hash sign should be removed. so it – Gal Ben-Haim Jun 05 '12 at 14:08
  • @GalBen-Haim Your original question doesn't say anything about stripping the pound sign. I've updated my answer based on your comment above, but please update the question itself to reflect what you're really asking. In future, you can improve your questions by posting a sample of what you want the output to look like. – Todd A. Jacobs Jun 05 '12 at 14:22
  • if the last character is a plus sign it captures it "#test+".scan /\b(?<=#)[^#[:punct:]]+/ => ["test+"] – Gal Ben-Haim Jun 05 '12 at 14:23
  • it also creates duplicates in the result array – Gal Ben-Haim Jun 05 '12 at 14:31
  • 1
    @GalBen-Haim This is called "scope creep." You asked for a regular expression that matches, but you keep changing your requirements. Please feel free to adapt the updated regular expression for your needs, or to build additional code around the results of your expression match. Good luck! – Todd A. Jacobs Jun 05 '12 at 14:35
0

How about this:

@hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]

Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676

Also removes # at beginning

Or if you want it in array form:

 @hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}

Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.

Subs
  • 529
  • 2
  • 9
0

This should work:

@hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten

Or if you don't want your hashtag to start with a special character:

@hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten
Stefan
  • 109,145
  • 14
  • 143
  • 218
  • I really liked your example of using capture groups as an alternative to matching zero-width assertions. You should probably add UTF-8 encoding, though, as that's part of the OPs question. – Todd A. Jacobs Jun 05 '12 at 15:18