1

I'm using regex in python and trying to extract 'Hindi' character from the given string and then print it but I'm not able to do so. I'm trying to extract जनवरी12 and जनवरी22 from the string. The code should search for a phrase that starts with जनवरी(or any hindi character) and ends with 12( or any number). Here is the code:

import re

string = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
mo = re.compile(r'[^(^a-zA-Z-0-9)]+\d+')
print(mo.findall(string))

Output: [' 12', 'वें संस्करण जनवरी12', ' 12', ' जनवरी22']

I know that [^abc] matches any character that isn’t between the brackets and tried to achieve the same with [^(^a-zA-Z-0-9)]+ but the output is not what I expected.

Expected output: जनवरी12, जनवरी22

Can anyone explain me how this should be done and matching the start and end in Python's regex?

  • 2
    I think you are missing excluding a whitespace char `[^a-zA-Z-0-9\s]+\d+` https://regex101.com/r/UiyGii/1 – The fourth bird Feb 07 '20 at 10:23
  • @Thefourthbird That fixed it. Should I be using a '$' at the end to match preceding string(number)? If so, how? – Vipul Priyadarshi Feb 07 '20 at 10:26
  • But `[^a-zA-Z-0-9\s]` matches any char but ASCII alphanumeric chars or whitespace. Don't you want to match any 1+ letters other than ASCII and then 1+ digits? – Wiktor Stribiżew Feb 07 '20 at 10:27
  • @WiktorStribiżew I just want to match hindi characters followed by a digit. This code actually even matches other languages such as Urdu. Is there any built in hindi language module for fetching only hindi characters? – Vipul Priyadarshi Feb 07 '20 at 10:30
  • 1
    If you use a `$` at the end of the pattern, you will have a single match because it will assert the end of the string. There is also a hyphen in the middle which I assume can be omitted. If there should be not partial matches, you could use `[^a-zA-Z0-9\s]+\d+\b` or `[^a-zA-Z0-9\s]+\d+(?!\S)` – The fourth bird Feb 07 '20 at 10:31
  • @Thefourthbird As you said, I always had a single match while using $. [^a-zA-Z0-9\s]+\d+(?!\S) this works for me. Can you explain this part \d+(?!\S) ? – Vipul Priyadarshi Feb 07 '20 at 10:35
  • 1
    `\d+(?!\S)` will match 1 or more digits and uses a negated character class to assert what is on the right is not a non whitespace char. – The fourth bird Feb 07 '20 at 10:45

1 Answers1

2

I think you just need a pattern that matches 1+ letters (with 0 or more diacritics after each) and then 1+ digits.

See a Python demo that outputs ['जनवरी12', 'जनवरी22']:

import re
s = "विश्व कप sdsd 12वें संस्करण जनवरी12 or 12जनवरी or जनवरी22"
combining_marks = '[\u0300-\u036F\u0483-\u0489\u0591-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E4\u06E7\u06E8\u06EA-\u06ED\u0711\u0730-\u074A\u07A6-\u07B0\u07EB-\u07F3\u07FD\u0816-\u0819\u081B-\u0823\u0825-\u0827\u0829-\u082D\u0859-\u085B\u08D3-\u08E1\u08E3-\u0903\u093A-\u093C\u093E-\u094F\u0951-\u0957\u0962\u0963\u0981-\u0983\u09BC\u09BE-\u09C4\u09C7\u09C8\u09CB-\u09CD\u09D7\u09E2\u09E3\u09FE\u0A01-\u0A03\u0A3C\u0A3E-\u0A42\u0A47\u0A48\u0A4B-\u0A4D\u0A51\u0A70\u0A71\u0A75\u0A81-\u0A83\u0ABC\u0ABE-\u0AC5\u0AC7-\u0AC9\u0ACB-\u0ACD\u0AE2\u0AE3\u0AFA-\u0AFF\u0B01-\u0B03\u0B3C\u0B3E-\u0B44\u0B47\u0B48\u0B4B-\u0B4D\u0B56\u0B57\u0B62\u0B63\u0B82\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCD\u0BD7\u0C00-\u0C04\u0C3E-\u0C44\u0C46-\u0C48\u0C4A-\u0C4D\u0C55\u0C56\u0C62\u0C63\u0C81-\u0C83\u0CBC\u0CBE-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCD\u0CD5\u0CD6\u0CE2\u0CE3\u0D00-\u0D03\u0D3B\u0D3C\u0D3E-\u0D44\u0D46-\u0D48\u0D4A-\u0D4D\u0D57\u0D62\u0D63\u0D82\u0D83\u0DCA\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E31\u0E34-\u0E3A\u0E47-\u0E4E\u0EB1\u0EB4-\u0EBC\u0EC8-\u0ECD\u0F18\u0F19\u0F35\u0F37\u0F39\u0F3E\u0F3F\u0F71-\u0F84\u0F86\u0F87\u0F8D-\u0F97\u0F99-\u0FBC\u0FC6\u102B-\u103E\u1056-\u1059\u105E-\u1060\u1062-\u1064\u1067-\u106D\u1071-\u1074\u1082-\u108D\u108F\u109A-\u109D\u135D-\u135F\u1712-\u1714\u1732-\u1734\u1752\u1753\u1772\u1773\u17B4-\u17D3\u17DD\u180B-\u180D\u1885\u1886\u18A9\u1920-\u192B\u1930-\u193B\u1A17-\u1A1B\u1A55-\u1A5E\u1A60-\u1A7C\u1A7F\u1AB0-\u1ABE\u1B00-\u1B04\u1B34-\u1B44\u1B6B-\u1B73\u1B80-\u1B82\u1BA1-\u1BAD\u1BE6-\u1BF3\u1C24-\u1C37\u1CD0-\u1CD2\u1CD4-\u1CE8\u1CED\u1CF4\u1CF7-\u1CF9\u1DC0-\u1DF9\u1DFB-\u1DFF\u20D0-\u20F0\u2CEF-\u2CF1\u2D7F\u2DE0-\u2DFF\u302A-\u302F\u3099\u309A\uA66F-\uA672\uA674-\uA67D\uA69E\uA69F\uA6F0\uA6F1\uA802\uA806\uA80B\uA823-\uA827\uA880\uA881\uA8B4-\uA8C5\uA8E0-\uA8F1\uA8FF\uA926-\uA92D\uA947-\uA953\uA980-\uA983\uA9B3-\uA9C0\uA9E5\uAA29-\uAA36\uAA43\uAA4C\uAA4D\uAA7B-\uAA7D\uAAB0\uAAB2-\uAAB4\uAAB7\uAAB8\uAABE\uAABF\uAAC1\uAAEB-\uAAEF\uAAF5\uAAF6\uABE3-\uABEA\uABEC\uABED\uFB1E\uFE00-\uFE0F\uFE20-\uFE2F\U000101FD\U000102E0\U00010376-\U0001037A\U00010A01-\U00010A03\U00010A05\U00010A06\U00010A0C-\U00010A0F\U00010A38-\U00010A3A\U00010A3F\U00010AE5\U00010AE6\U00010D24-\U00010D27\U00010F46-\U00010F50\U00011000-\U00011002\U00011038-\U00011046\U0001107F-\U00011082\U000110B0-\U000110BA\U00011100-\U00011102\U00011127-\U00011134\U00011145\U00011146\U00011173\U00011180-\U00011182\U000111B3-\U000111C0\U000111C9-\U000111CC\U0001122C-\U00011237\U0001123E\U000112DF-\U000112EA\U00011300-\U00011303\U0001133B\U0001133C\U0001133E-\U00011344\U00011347\U00011348\U0001134B-\U0001134D\U00011357\U00011362\U00011363\U00011366-\U0001136C\U00011370-\U00011374\U00011435-\U00011446\U0001145E\U000114B0-\U000114C3\U000115AF-\U000115B5\U000115B8-\U000115C0\U000115DC\U000115DD\U00011630-\U00011640\U000116AB-\U000116B7\U0001171D-\U0001172B\U0001182C-\U0001183A\U000119D1-\U000119D7\U000119DA-\U000119E0\U000119E4\U00011A01-\U00011A0A\U00011A33-\U00011A39\U00011A3B-\U00011A3E\U00011A47\U00011A51-\U00011A5B\U00011A8A-\U00011A99\U00011C2F-\U00011C36\U00011C38-\U00011C3F\U00011C92-\U00011CA7\U00011CA9-\U00011CB6\U00011D31-\U00011D36\U00011D3A\U00011D3C\U00011D3D\U00011D3F-\U00011D45\U00011D47\U00011D8A-\U00011D8E\U00011D90\U00011D91\U00011D93-\U00011D97\U00011EF3-\U00011EF6\U00016AF0-\U00016AF4\U00016B30-\U00016B36\U00016F4F\U00016F51-\U00016F87\U00016F8F-\U00016F92\U0001BC9D\U0001BC9E\U0001D165-\U0001D169\U0001D16D-\U0001D172\U0001D17B-\U0001D182\U0001D185-\U0001D18B\U0001D1AA-\U0001D1AD\U0001D242-\U0001D244\U0001DA00-\U0001DA36\U0001DA3B-\U0001DA6C\U0001DA75\U0001DA84\U0001DA9B-\U0001DA9F\U0001DAA1-\U0001DAAF\U0001E000-\U0001E006\U0001E008-\U0001E018\U0001E01B-\U0001E021\U0001E023\U0001E024\U0001E026-\U0001E02A\U0001E130-\U0001E136\U0001E2EC-\U0001E2EF\U0001E8D0-\U0001E8D6\U0001E944-\U0001E94A\U000E0100-\U000E01EF]'
mo = re.compile(r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks))
print(mo.findall(s))

Note that r'(?:[^\W\d_]{}*)+\d+'.format(combining_marks) creates a pattern that matches

  • (?:[^\W\d_]{}*)+ - one or more occurrences of
    • [^\W\d_] - any Unicode base letter (if you want to disallow ASCII letters, add (?![A-Za-z]) right before this pattern)
    • {}* - zero or more occurrences of combining_marks
  • \d+ - 1+ digits

So, if you want to avoid matching ASCII letters, in the above code, use

r'(?:(?![A-Za-z])[^\W\d_]{}*)+\d+'
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    That is impressive ++ How or where do you get those ranges? – The fourth bird Feb 07 '20 at 10:41
  • 1
    @Thefourthbird It is no secret, these codes can be obtained at [Unicode Utilities: UnicodeSet](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BM%7D&abb=on&ucd=on&esc=on&g=&i=). Actually, I have updated the code above since Python can deal with `\UXXXXXXX` code points in regex patterns. – Wiktor Stribiżew Feb 07 '20 at 10:46