I have some text that was generate by another system. It combined some words together in what I assume was some sort of wordwrap by-product. So something simple like 'the dog' is combine into 'thedog'.
I checked the ascii and unicode string to see is there wasn't some unseen character in there, but there wasn't. A confounding problem is that this is medical text and a corpus to check against aren't that available. So, real example is '...test to rule out SARS versus pneumonia' ends up as '... versuspneumonia.'
Anyone have a suggestion for finding and separating these?