7

I am using spaCys NLP model to work out the POS of input data so that the my Markov chains can be a bit more gramatically correct as with the example in the python markovify library found here. However the way that spaCy splits tokens makes it difficult when reconstructing them because certain grammatical elements are also split up for example "don't" becomes ["do", "n't"]. This means that you can't rejoin generated Markov chains simply by space anymore but need to know if the tokens make up one word.

I assumed that the is_left_punct and is_right_punct properties of tokens might relate to this but it doesn't seem to be related. My current code simply accounts for PUNCT tokens but the do n't problem persists.

Is there a property of the tokens that I can use to tell the method that joins sentences together when to omit spaces or some other way to know this?

Auh
  • 145
  • 11

1 Answers1

7

Spacy tokens have a whitespace_ attribute which is always set.

You can always use that as it will represent actual spaces when they were present, or be an empty string when it was not.

This occurs in cases like you mentioned, when the tokenisation splits a continuous string.

So Token("do").whitespace_ will be the empty string.

For example

[bool(token.whitespace_) for token in nlp("don't")]

Should produce

[False, False]
Nathan McCoy
  • 3,092
  • 1
  • 24
  • 46
  • 1
    Thanks very much. Just a note for whoever uses this. It's more useful to know if a space goes before the word rather than after so after I `nlp` a sentence I shift all the `whitespace_` attributes to the left for generating markov chains. Thanks again. – Auh Apr 03 '19 at 21:08
  • @Auh — I'm struggling with implementing your method. Would you mind posting a simple example (assuming spaCy/markovify)? – lightyrs May 12 '19 at 00:58
  • 1
    i'll try to help, can you explain exactly what you want? – Nathan McCoy May 12 '19 at 20:45
  • I'm doing this for training: `str(bool(word.whitespace_))))` and this for generation: `if whitespace == "True": sentence += f"{word} "` Does this make sense? Does the whitespace property need to be converted to string to train the markov model? @NathanMcCoy – lightyrs May 18 '19 at 19:43