-3

I have the following random data generated by parsing an image - https://dpaste.de/wwuj/raw

I want to generate a csv and need to extract the following data from the text

नाम, पति का नाम, मकान संख्या, आयु, लिंग

Questions :

  1. Can we use regex to parse non-english characters in python?

  2. It would be good if you could show a small demo on how to get the field values.

Thanks.

Elon Musk
  • 332
  • 1
  • 2
  • 12
  • 2
    Are you using Python3? If so yes, unicodes are supported – Olivier Melançon Sep 06 '18 at 13:06
  • Regex works for your case and tested. Checked on `re.findall(r'नाम','नाम, पति का नाम, मकान संख्या, आयु, लिंग')` which returned `['नाम', 'नाम']` – Space Impact Sep 06 '18 at 13:09
  • there are multiple of them, i have to create one row for each `नाम, पति का नाम, मकान संख्या, आयु, लिंग` re.findall(r'नाम') returns all names ignoring lines. grouping is not possible. – Elon Musk Sep 06 '18 at 13:11
  • 1
    @ElonMusk Please provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). Which helps us to provide a solution. – Space Impact Sep 06 '18 at 13:13

1 Answers1

0

Do you already know which language you are working with? If yes, Unicode Blocks 1 can help you getting the range of the orthographic alphabet. If not, Unicode Blocks 2 can help you getting an idea of the range you are in with the orthographic language then you can use either to define a regex range to fine every character specific to that orthographic language.

I don't know if you have a file or the data is already stored in python so I will avoid the loop in the code to match each line, but the following regex should allow you to get the desired content:

regex = ur'[\u0020-\u007F]' # This is basic Latin orthographic language range if you want multiple ranges you can use ur'[\u0020-\u007F\u00A0-\u00FF]'
regex = regex.decode("raw-unicode-escape")
reg_compiled = re.compile(regex)
de_item = item.decode('utf-8') #Item stands for your string/line/variable or whatsoever
if reg.search(de_item):
   return item # or print(item)

I know everything is pretty verbose but I prefer the code to be very clear so that who reads it understands it immediately.

It's up to you to decide what item is, but in case you have :

आयु, hello लिंग

as an item, it will return the whole string