1

So I have been trying to construct a regex that can detect the pattern {word}{.,#}{word} and seperate it into [word,',' (or '.','#'), word].

But i am not able to create one that does strict matching for this pattern and ignores everything else.

I used the following regex

r"[\w]+|[.]"

this one is doing well , but it doesnt do strict matching, as in if (,, # or .) characters dont occur in text, it will still give me words, which i dont want.

I would like to have a regex which strictly matches the above pattern and gives me the splits(using re.findall) and if not returns the whole word as it is.

Please Note: word on either side of the {,.#} , both words are not strictly to be present but atleast one should be present

Some example text for reference:

no.16         would give me ['no','.','16']
#400          would give me ['#,'400']
word1.word2   would give me ['word1','.','word2']

Looking forward to some help and assistance from all regex gurus out there

EDIT:

I forgot to add this. @viktor's version works as needed with only one problem, It ignores ALL other words during re.findall

eg. ONE TWO THREE #400 with the viktor's regex gives me ['','#','400']

but what was expected was ['ONE','TWO','THREE','#',400]

this can be done with NLTK or spacy, but use of those is a limitation.

Aditya Vartak
  • 380
  • 1
  • 13
  • What do you mean by "detect the pattern"? I.e. what would your expected outcome be for a string like "Some word#and another"? – buddemat Nov 09 '20 at 14:39
  • it would be a list of seperated entities of string based on regex after passing through re.findalll – Aditya Vartak Nov 10 '20 at 05:09

3 Answers3

4

I suggest using

(\w+)?([.,#])((?(1)\w*|\w+))

See the regex demo.

Details

  • (\w+)? - An optional group #1: one or more word chars
  • ([.,#]) - Group #2: ., , or #
  • ((?(1)\w*|\w+)) - Group #3: if Group 1 matched, match zero or more word chars (the word is optional on the right side then), else, match one or more word chars (there must be a word on the right side of the punctuation chars since there is no word before them).

See the Python demo:

import re
pattern = re.compile(r'(\w+)?([.,#])((?(1)\w*|\w+))')
strings = ['no.16', '#400', 'word1.word2', 'word', '123']
for s in strings:
    print(s, ' -> ', pattern.findall(s))

Output:

no.16  ->  [('no', '.', '16')]
#400  ->  [('', '#', '400')]
word1.word2  ->  [('word1', '.', 'word2')]
word  ->  []
123  ->  []

The answer to your edit is

if re.search(r'\w[.,#]|[.,#]\w', text): 
    print( re.findall(r'[.,#]|[^\s.,#]+', text) )

If there is a word char, then any of the three punctuation symbols, and then a word char again in the input string, you can find and extract all occurrences of the [.,#]|[^\s.,#]+ pattern, namely a ., , or #, or one or more occurrences of any one or more chars other than whitespace, ., , and #.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    Initially I thought of using an [alternation](https://regex101.com/r/UUBLU4/1), but then you have all the empty strings in the result due the the groups at the left and right. This if clause brilliantly solves that ++ – The fourth bird Nov 09 '20 at 20:37
  • Thanks Viktor for the RE . Seems usable for my requirement. – Aditya Vartak Nov 10 '20 at 05:05
  • @viktor it does not return words around the match. eg. ONE TWO THREE #400 should gave given `['ONE','TWO','THREE','#','400']` – Aditya Vartak Nov 10 '20 at 09:19
  • @AdityaVartak Do you mean you want to check if there is a `.`, `,`, or `#` with a word char on either end of the char in the string and then simply extract all words and `[.,#]`? Like `if re.search(r'\w[.,#]|[.,#]\w', text): print(re.findall(r'\w+|[.,#]', text))`? – Wiktor Stribiżew Nov 10 '20 at 09:22
  • Yes @viktor youre right. I tried your suggested if statement and i found it is ignoring words with '-'. So `192-168-1000 st #400` becomes `['192','168','1000','st','#','400']`. What i needed was `['192-168-1000','st','#','400']` meaning that all other characters except the ones we mentuoned in regex should be present in the form they are encountered without splitting it – Aditya Vartak Nov 10 '20 at 09:36
  • @AdityaVartak `if re.search(r'\w[.,#]|[.,#]\w', text): print(re.split(r'\s*([.,#])\s*', text))`? – Wiktor Stribiżew Nov 10 '20 at 09:37
  • @viktor youre very close. Now using your latest if statement, for text `192-168-1000 St no.15 on #400` we get `['192-168-1000 St no', '.', '15 on ', '#', '400']` Notice that it is not splitting `'192-168-1000 St no'` into `['192-168-1000','St','no']` this change might be exactly the answer – Aditya Vartak Nov 10 '20 at 09:41
  • @AdityaVartak Try `re.findall(r'[.,#]|[^\s.,#]+', text)` – Wiktor Stribiżew Nov 10 '20 at 09:46
  • @viktor Thanks for your prompt replies! This works exactly as i intended to! I cant thank you enough – Aditya Vartak Nov 10 '20 at 09:51
1

I hope this code will solve your problem if you want to split the string by any of the mentioned special characters:

a='no.16'
b='#400'
c='word1.word2'

lst=[a, b, c]

for elem in lst:
    result= re.split('(\.|#|,)',elem)
    while('' in result):
        result.remove('')
    print(result)

Qamar Abbas
  • 176
  • 7
  • I think this won't work in case there are multiple expected matches. Try with `a='ww.www no.16'`. We could ask OP though. – Ryszard Czech Nov 09 '20 at 21:14
  • Thanks for your answer! this can work . But i was looking for something like what @Viktor posted . though this one can be made more efficient by using regex from viktor. thanks for the answer – Aditya Vartak Nov 10 '20 at 05:04
  • @RyszardCzech i have checked its working on a='ww.www no.16' , the output in this case is ['ww', '.', 'www no', '.', '16'] – Qamar Abbas Nov 10 '20 at 09:06
  • @QamarAbbas Yes, sort of works, but OP has already mentioned it is not what they want. – Ryszard Czech Nov 10 '20 at 20:13
0

You could do something like this:

import re

str = "no.16"

pattern = re.compile(r"(\w+)([.|#])(\w+)")

result = list(filter(None, pattern.split(str)))

The list(filter(...)) part is needed to remove the empty strings that split returns (see Python - re.split: extra empty strings that the beginning and end list).

However, this will only work if your string only contains these two words separated by one of the delimiters specified by you. If there is additional content before or after the pattern, this will also be returned by split.

buddemat
  • 4,552
  • 14
  • 29
  • 49