4

I want to find if a word contains digit and characters and if so separate the digit part and the character part. I want to check for tamil words, ex: ரூ.100 or ரூ100. I want to seperate the ரூ. and 100, and ரூ and 100. How do i do it in python. I tried like this:

    for word in f.read().strip().split(): 
      for word1, word2, word3 in zip(word,word[1:],word[2:]): 
        if word1 == "ர" and word2 == "ூ " and word3.isdigit(): 
           print word1 
           print word2 
        if word1.decode('utf-8') == unichr(0xbb0) and word2.decode('utf-8') == unichr(0xbc2): 
           print word1 print word2
charvi
  • 211
  • 1
  • 5
  • 16
  • Have you tried anything? – Ismail Badawi Mar 30 '14 at 07:18
  • I tried checking if the first character is ரூ and if it is followed by a digit, but the problem was that i could not match with the unicode value, it throws an error. – charvi Mar 30 '14 at 07:21
  • this is what i tried: for word in f.read().strip().split(): for word1, word2, word3 in zip(word,word[1:],word[2:]): if word1 == "ர" and word2 == "ூ " : #and word3.isdigit(): print word1 print word2 if word1.decode('utf-8') == unichr(0xbb0) and word2.decode('utf-8') == unichr(0xbc2): print word1 print word2 – charvi Mar 30 '14 at 07:22
  • @Ismail Badawi but i also want to 100 and ஆம் in words like 100ஆம், so i thought the above code would anyways not be generic, so left it. – charvi Mar 30 '14 at 07:25
  • 1
    @charvi: post your code in your question. With formatting. – smci Mar 30 '14 at 10:03
  • @smci: from next time i will post my code in question. thank u. – charvi Mar 31 '14 at 16:38

2 Answers2

4

You can use (.*?)(\d+)(.*) regular expression, that will save 3 groups: everything before digits, digits and everything after:

>>> import re
>>> pattern = ur'(.*?)(\d+)(.*)'
>>> s = u"ரூ.100"
>>> match = re.match(pattern, s, re.UNICODE)
>>> print match.group(1)
ரூ.
>>> print match.group(2)
100

Or, you can unpack matched groups into variables, like this:

>>> s = u"100ஆம்"
>>> match = re.match(pattern, s, re.UNICODE)
>>> before, digits, after = match.groups()
>>> print before

>>> print digits
100
>>> print after
ஆம்

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
1

Use unicode properties:

\pL stands for a letter in any language
\pN stands for a digit in any language.

In your case it could be:

(\pL+\.?)(\pN+)
Toto
  • 89,455
  • 62
  • 89
  • 125