I would like to tokenize a tweet. As you probably know, tweets usually have informal forms, as follow:
This is a common Tweet #format where @mentions and.errors!!!!like this:-))))) might #appear❤ ❤☺❤#ThisIsAHashtag!?!
You may also have emoji in UNICODE format (heart, smiles, etc). I'm working on a preg_split to tokenize. The desidered ouput is:
This
is
a
common
Tweet
#format
where
@mentions
and
.
errors
!!!!
like
this
:-)))))
might
#appear
❤
❤
☺
❤
#ThisIsAHashtag
!?!
The current preg_split I've implemented so far is:
preg_split('/(?<=\s)|(?<=\w)(?=[.,:;!?(){}-])|(?<=[.,!()?\x{201C}])(?=[^ ])/u', $tweet);
Any help is appreciate.