How to extract matching strings into a defaultdict(set)? Python

Question

I have a textfile that has such lines (see below), where an english sentence is followed by a spanish sentence and the equivalent translation table delimited by "{##}". (if you know it it's the output for giza-pp)

you have requested a debate on this subject in the course of the next few days , during this part-session . {##} sus señorías han solicitado un debate sobre el tema para los próximos días , en el curso de este período de sesiones . {##} 0-0 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 12-10 13-11 14-11 15-12 16-13 17-14 9-15 10-16 11-17 18-18 17-19 19-21 20-22

The translation table is understood as such, 0-0 0-1 means that the 0th word in english (i.e. you) matches the 0th and 1st word in spanish (i.e. sus señorías)

Let's say i want to know what is the translation of course in spanish from the sentence, normally i'll do it this way:

from collections import defaultdict
eng, spa, trans =  x.split(" {##} ")
tt = defaultdict(set)
for s,t in [i.split("-") for i in trans.split(" ")]:
  tt[s].add(t)

query = 'course'
for i in spa.split(" ")[tt[eng.index(query)]]:
  print i

is there a simple way to do the above? may regex? line.find()?

After some tries i have to do this in order to cover many other issues like MWE and missing translations:

def getTranslation(gizaline,query):
    src, trg, trans =  gizaline.split(" {##} ")
    tt = defaultdict(set)
    for s,t in [i.split("-") for i in trans.split(" ")]:
        tt[int(s)].add(int(t))
    try:
        query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]
    except ValueError:
        for i in src.split(" "):
            if "-"+query or query+"-" in i:
                query = i
                break
        query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]

    if len(query_translated) > 0:
        return ":".join(query_translated)
    else:
        return "#NULL"

thanks for the note on the mistake. possibly someone might have a simpler way or at least faster way=) — alvas, Feb 28 '13 at 10:07
actually a `defaultdict(list)` would do fine too, just want a set so that i dont get duplicates. — alvas, Feb 28 '13 at 10:48
I'd use a `list` simply because it allows you to keep order, a `set` would not do that. As for duplicates, they can be beneficial - they show you the *order* that the words appear in spanish, even accounting for grammatical changes. If you don't allow duplicates, you'll only get one occurrence of each translation, and not ordered - meaning you don't know which translation is correct. — TyrantWave, Feb 28 '13 at 10:50

score 2 · Accepted Answer · answered Feb 28 '13 at 10:53

That way works fine, but I'd do it slightly differently, using list instead of set so we can order the words correctly (set will output words in alphabetical order, not quite what we want):

File: q_15125575.py

#-*- encoding: utf8 -*-
from collections import defaultdict

INPUT = """you have requested a debate on this subject in the course of the next few days , during this part-session . {##} sus señorías han solicitado un debate sobre el tema para los próximos días , en el curso de este período de sesiones . {##} 0-0 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 12-10 13-11 14-11 15-12 16-13 17-14 9-15 10-16 11-17 18-18 17-19 19-21 20-22"""

if __name__ == "__main__":
    english, spanish, trans = INPUT.split(" {##} ")
    eng_words = english.split(' ')
    spa_words = spanish.split(' ')
    transtable = defaultdict(list)
    for e, s in [i.split('-') for i in trans.split(' ')]:
        transtable[eng_words[int(e)]].append(spa_words[int(s)])

    print(transtable['course'])
    print(transtable['you'])
    print(" ".join(transtable['course']))
    print(" ".join(transtable['you']))

Output:
['curso']
['sus', 'se\xc3\xb1or\xc3\xadas']
curso
sus señorías

It's slightly longer code as I'm using the actual words instead of the indexes - but this allows you to directly lookup from transtable

However, both your method and my method both fail on the same issue: Repeating words.
print(" ".join(transtable['this'])
gives:
el este
It's at least in the order that the words appear though, so it's workable. Want the first occurrence of 'this' translated?
transtable['this'][0] would give you the first word.

And using your code:

tt = defaultdict(set)
for e, s in [i.split('-') for i in trans.split(' ')]:
    tt[int(e)].add(int(s))

query = 'this'
for i in tt[eng_words.index(query)]:
    print i

Gives:
7

Your code will only print the index of the first occurrence of a word.

Your code in the question as is won't even work though, it has quite a few errors. You're adding strings to tt{} (`tt[s].add(t)` would give `tt['0'] = '1'` for example). Then when you get the index, you're grabbing it from the raw string `eng`, not the split words. This index is a number, not string, so it will always return nothing. If you change it to be `tt[int(s)].add(int(t))`, the next problem is that it would fail because `spanish.split(" ")` is expecting an integer, not a set. And lastly, (assuming `eng` is the split words), `eng.index(query)` will still only return the first result. — TyrantWave, Feb 28 '13 at 11:46
thanks for noting the missing codes, i'll put my full code up. =) — alvas, Feb 28 '13 at 12:01

How to extract matching strings into a defaultdict(set)? Python

1 Answers1