1

I'm thinking of leveraging Lucene's StandardTokenizer for word tokenization in a non-IR context.

I understand that this tokenizer removes punctuation characters. Would anybody know (or happen to have experience with) making it output punctuation characters as separate tokens?

Example of current behaviour:

Welcome, Dr. Chasuble! => Welcome Dr. Chasuble

Example of desired behaviour:

Welcome, Dr. Chasuble! => Welcome , Dr. Chasuble !
sam
  • 1,406
  • 2
  • 15
  • 25
  • 2
    You can use the Whitespace tokenizer, followed by a pattern tokenizer which splits on word boundaries i.e. the regex `\b` – arun Feb 05 '15 at 20:01

2 Answers2

2

Generally, for custom tokenization of both IR and non-IR content it is a good idea to use ICU (ICU4J is the Java version). This would be a good place to start: http://userguide.icu-project.org/boundaryanalysis

The tricky part is preserving the period as part of "Dr.". You would have to use the dictionary based iterator; or, optionally, implement your own heuristic, either in your code or by creating your own iterator, which in ICU can be created as a file with a number of regexp-style definitions.

Alex Nevidomsky
  • 668
  • 7
  • 14
  • 1
    Thanks a lot for your pointers and suggestions. I ended up leveraging Apache Lucene's [UAX29URLEmailTokenizer](https://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_10_2/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerImpl.jflex) JFlex grammar, as suggested in [this](http://stackoverflow.com/questions/7846305/generating-a-custom-tokenizer-for-new-tokenstream-api-using-jflex-java-cc) thread. – sam Feb 17 '15 at 15:17
2

You can consider using tokenization tool from NLP community instead. Usually such issues have been well taken care of.

Some off-the-shelf tools are stanford corenlp (they have individual components for tokenization as well). UIUC's pipeline should also handle it elegently. http://cogcomp.cs.illinois.edu/page/software/

yiping
  • 96
  • 10