I'm thinking of leveraging Lucene's StandardTokenizer for word tokenization in a non-IR context.
I understand that this tokenizer removes punctuation characters. Would anybody know (or happen to have experience with) making it output punctuation characters as separate tokens?
Example of current behaviour:
Welcome, Dr. Chasuble! => Welcome Dr. Chasuble
Example of desired behaviour:
Welcome, Dr. Chasuble! => Welcome , Dr. Chasuble !