I have a textfile that has such lines (see below), where an english sentence is followed by a spanish sentence and the equivalent translation table delimited by "{##}
". (if you know it it's the output for giza-pp
)
you have requested a debate on this subject in the course of the next few days , during this part-session . {##} sus señorías han solicitado un debate sobre el tema para los próximos días , en el curso de este período de sesiones . {##} 0-0 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 12-10 13-11 14-11 15-12 16-13 17-14 9-15 10-16 11-17 18-18 17-19 19-21 20-22
The translation table is understood as such, 0-0 0-1
means that the 0th word in english (i.e. you
) matches the 0th and 1st word in spanish (i.e. sus señorías
)
Let's say i want to know what is the translation of course
in spanish from the sentence, normally i'll do it this way:
from collections import defaultdict
eng, spa, trans = x.split(" {##} ")
tt = defaultdict(set)
for s,t in [i.split("-") for i in trans.split(" ")]:
tt[s].add(t)
query = 'course'
for i in spa.split(" ")[tt[eng.index(query)]]:
print i
is there a simple way to do the above? may regex
? line.find()
?
After some tries i have to do this in order to cover many other issues like MWE and missing translations:
def getTranslation(gizaline,query):
src, trg, trans = gizaline.split(" {##} ")
tt = defaultdict(set)
for s,t in [i.split("-") for i in trans.split(" ")]:
tt[int(s)].add(int(t))
try:
query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]
except ValueError:
for i in src.split(" "):
if "-"+query or query+"-" in i:
query = i
break
query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]
if len(query_translated) > 0:
return ":".join(query_translated)
else:
return "#NULL"