I'm processing the text from Common Crawl (the WET
format) and from what I see, there's a lot of broken punctuation - most likely caused when linebreaks were removed from the original data.
For example, in This Massive Rally?The 52
, the question mark and The
should be separated by a space. I try to fix this problem with the following regexp (in Java):
line.replaceAll("([.;:,!?)])([A-Z])", "$1 $2");
While it handles most of the cases properly, it adds spaces in places where it shouldn't, e.g. U.S.
becomes U. S.
or www.HiringJobTweets.com
becomes www. HiringJobTweets.com
.
Is there a way to solve the problem while avoiding the undesired side-effects?