How to parse non-english mixed text in Python

Question

I have the following random data generated by parsing an image - https://dpaste.de/wwuj/raw

I want to generate a csv and need to extract the following data from the text

नाम, पति का नाम, मकान संख्या, आयु, लिंग

Questions :

Can we use regex to parse non-english characters in python?
It would be good if you could show a small demo on how to get the field values.

Thanks.

Regex works for your case and tested. Checked on `re.findall(r'नाम','नाम, पति का नाम, मकान संख्या, आयु, लिंग')` which returned `['नाम', 'नाम']` — Space Impact, Sep 06 '18 at 13:09
there are multiple of them, i have to create one row for each `नाम, पति का नाम, मकान संख्या, आयु, लिंग` re.findall(r'नाम') returns all names ignoring lines. grouping is not possible. — Elon Musk, Sep 06 '18 at 13:11
@ElonMusk Please provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). Which helps us to provide a solution. — Space Impact, Sep 06 '18 at 13:13

score 0 · Accepted Answer · answered Sep 06 '18 at 13:22

Do you already know which language you are working with? If yes, Unicode Blocks 1 can help you getting the range of the orthographic alphabet. If not, Unicode Blocks 2 can help you getting an idea of the range you are in with the orthographic language then you can use either to define a regex range to fine every character specific to that orthographic language.

I don't know if you have a file or the data is already stored in python so I will avoid the loop in the code to match each line, but the following regex should allow you to get the desired content:

regex = ur'[\u0020-\u007F]' # This is basic Latin orthographic language range if you want multiple ranges you can use ur'[\u0020-\u007F\u00A0-\u00FF]'
regex = regex.decode("raw-unicode-escape")
reg_compiled = re.compile(regex)
de_item = item.decode('utf-8') #Item stands for your string/line/variable or whatsoever
if reg.search(de_item):
   return item # or print(item)

I know everything is pretty verbose but I prefer the code to be very clear so that who reads it understands it immediately.

It's up to you to decide what item is, but in case you have :

आयु, hello लिंग

as an item, it will return the whole string

How to parse non-english mixed text in Python

1 Answers1