Tokenizing Twitter Posts in Lucene

Question

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?

More detailed version:

I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?

My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?

Thanks in advance!

Amaç

if you need a solution for solr this could help: https://issues.apache.org/jira/browse/SOLR-2059 and something like "# => ALPHA" "@ => ALPHA" — Karussell, Aug 25 '10 at 17:07

score 5 · Accepted Answer · edited Aug 30 '21 at 15:08

5

The StandardTokenizer and StandardAnalyzer basically pass your tokens through a StandardFilter (which removes all kinds of characters from your standard tokens like 's at ends of words), followed by a Lowercase filter (to lowercase your words) and finally by a StopFilter. That last one removes insignificant words like "as", "in", "for", etc.

What you could easily do to get started is implement your own analyzer that performs the same as the StandardAnalyzer but uses a WhitespaceTokenizer as the first item that processes the input stream.

For more details one the inner workings of the analyzers you can have a look over here

edited Aug 30 '21 at 15:08

Dwhitz

1,250
7
26
38

answered Apr 01 '10 at 06:19

Thomas

7,933
4
37
45

Thanks. I already tried implementing my own Analyzer by using WhitespaceTokenizer instead of StandardTokenizer. But that leaves host names, email addresses, and some other stuff unrecognized and tokenized erroneously. I would like to process a stream with my custom TwitterTokenizer (which handles @s and #s does nothing else) then feed the resulting stream into a StandardTokenizer and go on from there. However, as far as I understand an Analyzer can have only one Tokenizer at the beginning of the chain. – Ruggiero Spearman Apr 01 '10 at 08:56
1

Another approach could be to use PerFieldAnalyzerWrapper and make a second pass through the content to explicitely look for hash tags and user references and put them in a separate field of your document (e.g. 'tags' and 'replies'). The analyzers for those field then only return tokens for occurences of #tag and @user respectively. – Thomas Apr 01 '10 at 09:53
The link is broken. You can now view the analyzers [here](https://github.com/apache/lucene-solr/tree/master/lucene/analysis). – NightOwl888 May 24 '17 at 06:12

score 1 · Answer 2 · answered Apr 06 '19 at 15:50

It is cleaner to use a custom tokenizer that handles Twitter usernames natively. I have made one here: https://github.com/wetneb/lucene-twitter

This tokenizer will recognize Twitter usernames and hashtags, and a companion filter can be used to lowercase them (given that they are case-insensitive):

<fieldType name="text_twitter" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
    <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
  <analyzer type="query">
     <tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
     <filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
  </analyzer>
</fieldType>

score 0 · Answer 3 · answered Nov 23 '13 at 02:25

0

There's a Twitter-specific tokenizer here: https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java

answered Nov 23 '13 at 02:25

dranxo

3,348
4
35
48

score 0 · Answer 4 · answered Mar 01 '14 at 16:21

A tutorial on twitter specific tokenizer which is a modified version of ark-tweet-nlp API can be found at http://preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php This API is capable of identifying emoticons, hashtags,interjections etc present in a tweet

score 0 · Answer 5 · answered Mar 12 '14 at 08:35

The Twitter API can be told to return all Tweets, Bios etc with the "entities" (hashtags, userIds, urls etc) already parsed out of the content into collections.

https://dev.twitter.com/docs/entities

So aren't you just looking for a way to re-do something that the folks at Twitter have already done for you?

score 0 · Answer 6 · answered Dec 15 '17 at 07:49

Twitter open source there text process lib, implements token handler for hashtag etc.

such as: HashtagExtractor https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java

It is base on lucene's TokenStream.

Tokenizing Twitter Posts in Lucene

6 Answers6