I have a TMX file containing source and target segments. Some of these segments are made up of several sentences. My goal is to segment these multi-sentence segments so that the entire TMX file consists of single-sentence segments.
I intend to use spacy's dependency parser to segment these multi-sentence segments.
To achieve this, I have extracted the source and target segments using the Translate Toolkit package.
I then added the source and target segments to a dictionary (seg_dic). Next I converted these segments into nlp doc objects and again stored them in a dictionary (doc_dic). I now want to segment any multi-sentence segments using spacy's dependency parser ...
for sent in doc.sents:
print(sent.text)
... but I don't know how I can do this with the segments being stored in a dictionary.
This is what I have so far:
import spacy
from translate.storage.tmx import tmxfile
with open("./files/NTA_test.tmx", 'rb') as fin:
tmx_file = tmxfile(fin, 'de-DE', 'en-GB')
nlp_de = spacy.load("de_core_news_lg")
nlp_en = spacy.load("en_core_web_lg")
seg_dic = {}
doc_dic = {}
for node in tmx_file.unit_iter():
seg_dic[node.source] = node.target
for source_seg, target_seg in seg_dic.items():
doc_dic[nlp_de(source_seg)] = nlp_en(target_seg)
Can anyone explain how I can proceed from here? How can I iterate over my dictionary keys and values using the "for sent in doc.sents" logic?