0

My ultimate goal is to make a script to return a few pair of sentences from a TMX (Translation Memory Exchange) file. The file is from http://opus.nlpl.eu/OpenSubtitles2018.php and is about 2.1G.

I have tried reading it using the tmxfile module

from translate.storage.tmx import tmxfile
with open("da-en.tmx", 'rb') as fin:
    tmx_file = tmxfile(fin, 'da', 'en')

but seems it is not loading meaning endless waiting . I also tried a software called Stingray but as soon as I import the tmx file, it crashes.

I wonder what is the best strategy to achieve the goal ? I don't mind using AWK, Grep or other dedicated text parsing tools.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
Areza
  • 5,623
  • 7
  • 48
  • 79
  • It can take a bit of time to load a large file. Have you tried waiting while it is loading? – user1558604 Dec 01 '19 at 22:23
  • Yes - Indeed. I am giving up. Any tips how to do it in a batch for example ? Or converting it to a json, or CSV format ? – Areza Dec 01 '19 at 22:28
  • It depends on what you need from the TMX file. Clean segments? More than that? Try BeautifulSoup to parse all TU nodes. If it works well and fast enough, there might be some issue with the translate toolkit library. – Wiktor Stribiżew Dec 05 '19 at 16:56
  • Use Beautifulsoup with `lxml` parser. – Wiktor Stribiżew Feb 11 '20 at 20:51
  • @WiktorStribiżew - I have clearly written I need to extract a few pair of sentences ! can you make an example how to use the beautifulsoap in this case ? Thanks. – Areza Feb 29 '20 at 20:30

0 Answers0