Regex for a (twitter-like) hashtag that allows non-ASCII characters

Question

I want a regex to match a simple hashtag like that in twitter (e.g. #someword). I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese).

This was my initial regex: (^|\s|\b)(#(\w+))\b
--> but it doesn't recognize non standard characters.
Then, I tried using XRegExp.js, which worked, but ran too slowly.

Any suggestions for how to do it?

Word boundary can't be simply used with unicode. see http://www.unicode.org/reports/tr18/#Default_Word_Boundaries — Toto, Jun 05 '13 at 14:36

score 7 · Accepted Answer · edited Jun 06 '16 at 01:10

7

Eventually I found this: twitter-text.js useful link, which is basically how twitter solve this problem.

edited Jun 06 '16 at 01:10

Mohammad Kermani

5,188
7
37
61

answered Jun 16 '13 at 12:46

limlim

3,115
2
34
46

The excellent repo moved here : https://github.com/twitter/twitter-text/tree/master/js where it was aggregated with a list for all languages : https://github.com/twitter/twitter-text – Rakan Nimer Jun 03 '15 at 09:05

score 3 · Answer 2 · answered Jun 05 '13 at 14:36

With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example:

> s = "foo #הַתִּקְוָה. bar"
"foo #הַתִּקְוָה. bar"
> s.match(/#(.+?)(?=[\s.,:,]|$)/)
["#הַתִּקְוָה", "הַתִּקְוָה"]

The [\s.,:,] should include spaces, punctuation and whatever else can be considered a terminating symbol.

itsmejodie · Answer 3 · 2013-06-05T23:36:14.677

1

#([^#]+)[\s,;]*

Explanation: This regular expression will search for a # followed by one or more non-# characters, followed by 0 or more spaces, commas or semicolons.

var input = "#hasta #mañana #babהַ";
var matches = input.match(/#([^#]+)[\s,;]*/g);

Result:

["#hasta ", "#mañana ", "#babהַ"]

EDIT - Replaced \b for word boundary

edited Jun 05 '13 at 23:36

answered Jun 05 '13 at 14:23

itsmejodie

4,148
1
18
20

The `?` after the `+` just means "don't be too greedy" when trying to match all the non-hash characters. – itsmejodie Jun 05 '13 at 14:25
1

With the '?' it doesn't match '#mañana' and without it regonize '#mañana baby' as one hashtag. Not to mention Hebrew - doesn't recognize at all. – limlim Jun 05 '13 at 14:31
Word boundary `\b` is a zero-length assertion that is true between a word character and a non-word character – Toto Jun 05 '13 at 14:31
As pointed out \b isn't correct when dealing with non-latin characters. I've revised my answer. Typically hash-tags do not contain spaces @limlim – itsmejodie Jun 05 '13 at 23:39

Regex for a (twitter-like) hashtag that allows non-ASCII characters

3 Answers3

Linked

Related