Preserve punctuation characters when using Lucene's StandardTokenizer

Question

I'm thinking of leveraging Lucene's StandardTokenizer for word tokenization in a non-IR context.

I understand that this tokenizer removes punctuation characters. Would anybody know (or happen to have experience with) making it output punctuation characters as separate tokens?

Example of current behaviour:

Welcome, Dr. Chasuble! => Welcome Dr. Chasuble

Example of desired behaviour:

Welcome, Dr. Chasuble! => Welcome , Dr. Chasuble !

You can use the Whitespace tokenizer, followed by a pattern tokenizer which splits on word boundaries i.e. the regex `\b` — arun, Feb 05 '15 at 20:01

score 2 · Answer 1 · answered Feb 06 '15 at 15:11

2

Generally, for custom tokenization of both IR and non-IR content it is a good idea to use ICU (ICU4J is the Java version). This would be a good place to start: http://userguide.icu-project.org/boundaryanalysis

The tricky part is preserving the period as part of "Dr.". You would have to use the dictionary based iterator; or, optionally, implement your own heuristic, either in your code or by creating your own iterator, which in ICU can be created as a file with a number of regexp-style definitions.

answered Feb 06 '15 at 15:11

Alex Nevidomsky

668
7
14

1

Thanks a lot for your pointers and suggestions. I ended up leveraging Apache Lucene's [UAX29URLEmailTokenizer](https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_10_2/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerImpl.jflex) JFlex grammar, as suggested in [this](http://stackoverflow.com/questions/7846305/generating-a-custom-tokenizer-for-new-tokenstream-api-using-jflex-java-cc) thread. – sam Feb 17 '15 at 15:17

score 2 · Answer 2 · answered Feb 17 '15 at 03:56

You can consider using tokenization tool from NLP community instead. Usually such issues have been well taken care of.

Some off-the-shelf tools are stanford corenlp (they have individual components for tokenization as well). UIUC's pipeline should also handle it elegently. http://cogcomp.cs.illinois.edu/page/software/

Preserve punctuation characters when using Lucene's StandardTokenizer

2 Answers2