How to import and analyse a large TMX file in Python

Asked Dec 01 '19 at 22:21

Active Dec 01 '19 at 22:52

Viewed 1,154 times

My ultimate goal is to make a script to return a few pair of sentences from a TMX (Translation Memory Exchange) file. The file is from http://opus.nlpl.eu/OpenSubtitles2018.php and is about 2.1G.

I have tried reading it using the tmxfile module

from translate.storage.tmx import tmxfile
with open("da-en.tmx", 'rb') as fin:
    tmx_file = tmxfile(fin, 'da', 'en')

but seems it is not loading meaning endless waiting . I also tried a software called Stingray but as soon as I import the tmx file, it crashes.

I wonder what is the best strategy to achieve the goal ? I don't mind using AWK, Grep or other dedicated text parsing tools.

edited Dec 01 '19 at 22:52

Ed Morton

188,023
17
78
185

asked Dec 01 '19 at 22:21

Areza

5,623
7
48
79

It can take a bit of time to load a large file. Have you tried waiting while it is loading? – user1558604 Dec 01 '19 at 22:23
Yes - Indeed. I am giving up. Any tips how to do it in a batch for example ? Or converting it to a json, or CSV format ? – Areza Dec 01 '19 at 22:28
It depends on what you need from the TMX file. Clean segments? More than that? Try BeautifulSoup to parse all TU nodes. If it works well and fast enough, there might be some issue with the translate toolkit library. – Wiktor Stribiżew Dec 05 '19 at 16:56
Use Beautifulsoup with `lxml` parser. – Wiktor Stribiżew Feb 11 '20 at 20:51
@WiktorStribiżew - I have clearly written I need to extract a few pair of sentences ! can you make an example how to use the beautifulsoap in this case ? Thanks. – Areza Feb 29 '20 at 20:30

How to import and analyse a large TMX file in Python

0 Answers0