Fixing broken punctuation in CommonCrawl Text

Asked Oct 08 '15 at 12:50

Active Oct 08 '15 at 13:02

Viewed 46 times

I'm processing the text from Common Crawl (the WET format) and from what I see, there's a lot of broken punctuation - most likely caused when linebreaks were removed from the original data.

For example, in This Massive Rally?The 52, the question mark and The should be separated by a space. I try to fix this problem with the following regexp (in Java):

line.replaceAll("([.;:,!?)])([A-Z])", "$1 $2");

While it handles most of the cases properly, it adds spaces in places where it shouldn't, e.g. U.S. becomes U. S. or www.HiringJobTweets.com becomes www. HiringJobTweets.com.

Is there a way to solve the problem while avoiding the undesired side-effects?

edited Oct 08 '15 at 13:02

asked Oct 08 '15 at 12:50

Alexey Grigorev

2,415
28
47

No general solution, because of the loss of information when linebreaks are simply removed. Any way to keep that from happening? Or at least change linebreaks to space instead of simple removal? – Jeff Y Oct 08 '15 at 13:13
Well there is the raw crawl data with all original HTML entries that can be used to recover this structure. Unfortunately, the raw data needs additional processing (such as parsing) and it can become too expensive for the entire dataset – Alexey Grigorev Oct 08 '15 at 13:50

Fixing broken punctuation in CommonCrawl Text

0 Answers0