So I have been trying to construct a regex that can detect the pattern {word}{.,#}{word}
and seperate it into [word,',' (or '.','#'), word].
But i am not able to create one that does strict matching for this pattern and ignores everything else.
I used the following regex
r"[\w]+|[.]"
this one is doing well , but it doesnt do strict matching, as in if (,
, #
or .
) characters dont occur in text, it will still give me words, which i dont want.
I would like to have a regex which strictly matches the above pattern and gives me the splits(using re.findall
) and if not returns the whole word as it is.
Please Note: word on either side of the {,.#} , both words are not strictly to be present but atleast one should be present
Some example text for reference:
no.16 would give me ['no','.','16']
#400 would give me ['#,'400']
word1.word2 would give me ['word1','.','word2']
Looking forward to some help and assistance from all regex gurus out there
EDIT:
I forgot to add this. @viktor's version works as needed with only one problem, It ignores ALL other words during re.findall
eg. ONE TWO THREE #400
with the viktor's regex gives me ['','#','400']
but what was expected was ['ONE','TWO','THREE','#',400]
this can be done with NLTK or spacy, but use of those is a limitation.