Do you already know which language you are working with?
If yes, Unicode Blocks 1 can help you getting the range of the orthographic alphabet.
If not, Unicode Blocks 2 can help you getting an idea of the range you are in with the orthographic language then you can use either to define a regex range to fine every character specific to that orthographic language.
I don't know if you have a file or the data is already stored in python so I will avoid the loop in the code to match each line, but the following regex should allow you to get the desired content:
regex = ur'[\u0020-\u007F]' # This is basic Latin orthographic language range if you want multiple ranges you can use ur'[\u0020-\u007F\u00A0-\u00FF]'
regex = regex.decode("raw-unicode-escape")
reg_compiled = re.compile(regex)
de_item = item.decode('utf-8') #Item stands for your string/line/variable or whatsoever
if reg.search(de_item):
return item # or print(item)
I know everything is pretty verbose but I prefer the code to be very clear so that who reads it understands it immediately.
It's up to you to decide what item is, but in case you have :
आयु, hello लिंग
as an item, it will return the whole string