How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

Question

I want to extract #hashtags from a string, also those that have special characters such as #1+1.

Currently I'm using:

@hashtags ||= string.scan(/#\w+/)

But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.

How do I do this?

EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...

Also, the hash sign at the beginning should be removed.

@Subs: It is equivalent to `@hashtags = @hashtags || ...`. So `@hashtags` keeps its value if it's not `nil`/`false`/undefined and is set to the `scan` result otherwise. — undur_gongor, Jun 05 '12 at 14:24
You need to decide what you want before you ask a question and keep changing it while some people waste their time trying to help — Adriano Bacha, Jun 05 '12 at 15:11
@undur_gongor In fact it's `@hashtags || @hashtags = ...` (for most cases) — Holger Just, Jun 05 '12 at 19:39
@HolgerJust: That's true. Too bad I cannot correct the comment. — undur_gongor, Jun 06 '12 at 06:52

Todd A. Jacobs · Answer 1 · 2012-06-05T14:31:57.647

1

The Solution

You probably want something like:

'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]

Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.

References

edited Jun 05 '12 at 14:31

answered Jun 05 '12 at 14:03

Todd A. Jacobs

81,402
15
141
199

the beginning hash sign should be removed. so it – Gal Ben-Haim Jun 05 '12 at 14:08
@GalBen-Haim Your original question doesn't say anything about stripping the pound sign. I've updated my answer based on your comment above, but please update the question itself to reflect what you're really asking. In future, you can improve your questions by posting a sample of what you want the output to look like. – Todd A. Jacobs Jun 05 '12 at 14:22
if the last character is a plus sign it captures it "#test+".scan /\b(?<=#)[^#[:punct:]]+/ => ["test+"] – Gal Ben-Haim Jun 05 '12 at 14:23
it also creates duplicates in the result array – Gal Ben-Haim Jun 05 '12 at 14:31
1

@GalBen-Haim This is called "scope creep." You asked for a regular expression that matches, but you keep changing your requirements. Please feel free to adapt the updated regular expression for your needs, or to build additional code around the results of your expression match. Good luck! – Todd A. Jacobs Jun 05 '12 at 14:35

Subs · Answer 2 · 2012-06-05T16:06:24.340

0

How about this:

@hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]

Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676

Also removes # at beginning

Or if you want it in array form:

 @hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}

Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.

edited Jun 05 '12 at 16:06

answered Jun 05 '12 at 13:54

Subs

529
2
9

it creates extra nil values "#test".scan(/(#\w+)|(#[\d\+-]+\d+)/) => [["#test", nil]] – Gal Ben-Haim Jun 05 '12 at 14:03
@GalBen-Haim edited - Note: I included "-" in the OR part for numbers. You can remove it if you don't need it. – Subs Jun 05 '12 at 15:02

score 0 · Accepted Answer · answered Jun 05 '12 at 14:52

0

This should work:

@hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten

Or if you don't want your hashtag to start with a special character:

@hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten

answered Jun 05 '12 at 14:52

Stefan

109,145
14
143
218

I really liked your example of using capture groups as an alternative to matching zero-width assertions. You should probably add UTF-8 encoding, though, as that's part of the OPs question. – Todd A. Jacobs Jun 05 '12 at 15:18

How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

3 Answers3

The Solution

References