1

I have a TMX file containing source and target segments. Some of these segments are made up of several sentences. My goal is to segment these multi-sentence segments so that the entire TMX file consists of single-sentence segments.

I intend to use spacy's dependency parser to segment these multi-sentence segments.

To achieve this, I have extracted the source and target segments using the Translate Toolkit package.

I then added the source and target segments to a dictionary (seg_dic). Next I converted these segments into nlp doc objects and again stored them in a dictionary (doc_dic). I now want to segment any multi-sentence segments using spacy's dependency parser ...

for sent in doc.sents:
    print(sent.text)

... but I don't know how I can do this with the segments being stored in a dictionary.

This is what I have so far:

import spacy
from translate.storage.tmx import tmxfile

with open("./files/NTA_test.tmx", 'rb') as fin:
    tmx_file = tmxfile(fin, 'de-DE', 'en-GB')

nlp_de = spacy.load("de_core_news_lg")
nlp_en = spacy.load("en_core_web_lg")

seg_dic = {}
doc_dic = {}

for node in tmx_file.unit_iter():
    seg_dic[node.source] = node.target
for source_seg, target_seg in seg_dic.items():
    doc_dic[nlp_de(source_seg)] = nlp_en(target_seg)

Can anyone explain how I can proceed from here? How can I iterate over my dictionary keys and values using the "for sent in doc.sents" logic?

f5kdm85
  • 45
  • 1
  • 6

1 Answers1

1

The solution here is that you shouldn't put your stuff in a dictionary like that - use a list. Maybe something like this.

import spacy
from translate.storage.tmx import tmxfile

with open("./files/NTA_test.tmx", 'rb') as fin:
    tmx_file = tmxfile(fin, 'de-DE', 'en-GB')

de = spacy.load("de_core_news_lg")
en = spacy.load("en_core_web_lg")

out = []

for node in tmx_file.unit_iter():
    de_sents = list(de(node.source).sents)
    en_sents = list(en(node.target).sents)
    assert len(de_sents) == len(en_sents), "Different number of sentences!"
    
    for desent, ensent in zip(de_sents, en_sents):
        out.append( (desent, ensent) )

The hard part of this will be what to do when the number of sentences don't line up. Also note that I would be cautious about your conversion in the first place, as it's possible a translator did things wholistically, so even if the sentences line up by number there's no guarantee the first DE corresponds to the first EN, for example.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • 1
    Thank you so much, this is really helpful! – f5kdm85 Jun 23 '21 at 20:10
  • Glad I could help. For helpful answers, you can vote them up by clicking on the up arrow by their top left, and accept them by clicking the check mark. – polm23 Jun 24 '21 at 04:26
  • 1
    Thanks, I’ve accepted your answer but don’t seem to have enough points to upvote your reply. – f5kdm85 Jun 25 '21 at 05:39