identifying if the character is a digit or Unicode character within a word in python

Question

I want to find if a word contains digit and characters and if so separate the digit part and the character part. I want to check for tamil words, ex: ரூ.100 or ரூ100. I want to seperate the ரூ. and 100, and ரூ and 100. How do i do it in python. I tried like this:

    for word in f.read().strip().split(): 
      for word1, word2, word3 in zip(word,word[1:],word[2:]): 
        if word1 == "ர" and word2 == "ூ " and word3.isdigit(): 
           print word1 
           print word2 
        if word1.decode('utf-8') == unichr(0xbb0) and word2.decode('utf-8') == unichr(0xbc2): 
           print word1 print word2

I tried checking if the first character is ரூ and if it is followed by a digit, but the problem was that i could not match with the unicode value, it throws an error. — charvi, Mar 30 '14 at 07:21
this is what i tried: for word in f.read().strip().split(): for word1, word2, word3 in zip(word,word[1:],word[2:]): if word1 == "ர" and word2 == "ூ " : #and word3.isdigit(): print word1 print word2 if word1.decode('utf-8') == unichr(0xbb0) and word2.decode('utf-8') == unichr(0xbc2): print word1 print word2 — charvi, Mar 30 '14 at 07:22
@Ismail Badawi but i also want to 100 and ஆம் in words like 100ஆம், so i thought the above code would anyways not be generic, so left it. — charvi, Mar 30 '14 at 07:25
@smci: from next time i will post my code in question. thank u. — charvi, Mar 31 '14 at 16:38

alecxe · Accepted Answer · 2014-03-30T07:31:29.830

4

You can use (.*?)(\d+)(.*) regular expression, that will save 3 groups: everything before digits, digits and everything after:

>>> import re
>>> pattern = ur'(.*?)(\d+)(.*)'
>>> s = u"ரூ.100"
>>> match = re.match(pattern, s, re.UNICODE)
>>> print match.group(1)
ரூ.
>>> print match.group(2)
100

Or, you can unpack matched groups into variables, like this:

>>> s = u"100ஆம்"
>>> match = re.match(pattern, s, re.UNICODE)
>>> before, digits, after = match.groups()
>>> print before

>>> print digits
100
>>> print after
ஆம்

Hope that helps.

edited Mar 30 '14 at 07:31

answered Mar 30 '14 at 07:25

alecxe

462,703
120
1,088
1,195

I tried the first pattern matching you said and it works... Thank u. Il try the other one too. – charvi Mar 30 '14 at 07:39
thank u very much! the second one u said works too!! – charvi Mar 30 '14 at 07:55

score 1 · Answer 2 · answered Mar 30 '14 at 11:06

1

Use unicode properties:

\pL stands for a letter in any language
\pN stands for a digit in any language.

In your case it could be:

(\pL+\.?)(\pN+)

answered Mar 30 '14 at 11:06

Toto

89,455
62
89
125

identifying if the character is a digit or Unicode character within a word in python

2 Answers2