2

I would like to tokenize a tweet. As you probably know, tweets usually have informal forms, as follow:

This is a common Tweet #format where @mentions and.errors!!!!like this:-))))) might #appear❤ ❤☺❤#ThisIsAHashtag!?!

You may also have emoji in UNICODE format (heart, smiles, etc). I'm working on a preg_split to tokenize. The desidered ouput is:

This
is
a
common
Tweet
#format
where
@mentions
and
.
errors
!!!!
like
this
:-)))))
might
#appear
❤
❤
☺
❤
#ThisIsAHashtag
!?!

The current preg_split I've implemented so far is:

preg_split('/(?<=\s)|(?<=\w)(?=[.,:;!?(){}-])|(?<=[.,!()?\x{201C}])(?=[^ ])/u', $tweet);

Any help is appreciate.

Mauro
  • 361
  • 1
  • 4
  • 14
  • how does/should this differ from `explode(' ',$tweet)` ? and why is the 5th token `tweet` but you want `Tweet` ? – birdspider Aug 08 '14 at 15:15
  • @birdspider exploding by space doesn't work as there are tokens in his example tweet that are not separated by spaces. – Populus Aug 08 '14 at 15:19
  • Sorry, Tweet was mean to be uppercase in the original tweet. Exactly, the split by ' ' (spaces) can't be done... – Mauro Aug 08 '14 at 15:21
  • Why ❤ ❤❤ is splitted into three lines? – ST3 Aug 08 '14 at 15:22
  • @ST3 because you have "#appear❤ ❤❤#" - So the space is not tokenize. I want each emoji alone. The second series ❤❤# might be ☺☺❤❤ So each emoji shoul be tokenized. – Mauro Aug 08 '14 at 15:27

1 Answers1

4

You can use this pattern with preg_match_all:

~[#@]?\w+|\pP+|\S~u

online demo

Note: You can easily extend this pattern if you need to group another kind of characters. Example with currency:

~[#@]?\w+|\pP+|\p{Sc}+|\S~u
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • This is Amazing! Moreover, I was not aware about that regex101 web site. Thank you.. you made my day. – Mauro Aug 08 '14 at 15:41
  • I have a doubt. How to have token like I'm, it's? I mean those whit the apostrophe between two words? – Mauro Aug 11 '14 at 09:27
  • ~[#@]?\w+[']?\w*|\pP+|\p{Sc}+|\S~u => I've done like this. thanks :) – Mauro Aug 11 '14 at 09:33