3

Suppose I have a string such as this:

"IgotthistextfromapdfIscraped.HowdoIsplitthis?"

And I want to produce:

"I got this text from a pdf I scraped. How do I split this?"

How can I do it?

2 Answers2

4

It turns out that this task is called word segmentation, and there is a python library that can do that:

>>> from wordsegment import load, segment
>>> load()
>>> segment("IgotthistextfromapdfIscraped.HowdoIsplitthis?")
['i', 'got', 'this', 'text', 'from', 'a', 'pdf', 'i', 'scraped', 'how',
 'do', 'i', 'split', 'this']
2

Short answer: no realistic chance.

Long answer:

The only hint where to split the string is finding valid words in the string. So you need a dictionary of the expected language, containing not only the root words, but also all flexions (is that the correct linguistic term?). And then you can try to find a sequence of these words that matches the characters of your string.

Ralf Kleberhoff
  • 6,990
  • 1
  • 13
  • 7