2

I am translating Xliff file using BeautifulSoup and googletrans packages. I managed to extract all strings and translate them and managed to replace strings by creating new tag with a translations, e.g.

<trans-unit id="100890::53706_004">
<source>Continue in store</source>
<target>Kontynuuj w sklepie</target>
</trans-unit>

The problem appears when the source tag has other tags inside.

e.g.

<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>

There are different numbers of these tags and different order of where string appears. E.g. <source> text1 <x /> <x/> text2 <x/> text3 </source>. Each x tag is unique with different id and attributes.

Is there a way to modify the text inside the tag without having to create a new tag? I was thinking I could extract x tags and its attributes but the order or string and x tag in different code lines differs a lot I'm not sure how to do that. Maybe there is other package better suited for translating xliff files?

Sadra Saderi
  • 52
  • 1
  • 8
Julia
  • 23
  • 2
  • in question add expected result for this ``. With BeautifulSoup probably you would have to use `for`-loop or `list()` to get all children inside `` and work with them. – furas Feb 09 '21 at 16:23
  • Could you [edit] the question to show what output you want for a given source – Martin Evans Feb 09 '21 at 17:18
  • There are numerous tools (mostly commercial, some free) that make XLIFF translation a breeze. Try a search for "CAT tools". – Endre Both Feb 10 '21 at 11:12

3 Answers3

1

To extract the two text entries from within <source>, you could use the following approach:

from bs4 import BeautifulSoup
import requests

html = """<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>"""

soup = BeautifulSoup(html, 'lxml')
print(list(soup.source.stripped_strings))

Giving you:

['Choose your product', 'From a list:']
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
1

You can use for-loop to work with all children in source.
And you can duplicate them with copy.copy(child) and append to target.
At the same time you can check if child is NavigableString and convert it.


text = '''<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>'''

conversions = {
    'Choose your product': 'Wybierz swój produkt',
    'From a list: ': 'Z listy: ',
}

from bs4 import BeautifulSoup as BS
from bs4.element import NavigableString
import copy

#soup = BS(text, 'html.parser')  # it has problem to parse it
#soup = BS(text, 'html5lib')     # it has problem to parse it
soup = BS(text, 'lxml')

# create `<target>`
target = soup.new_tag('target')

# add `<target>` after `<source>
source = soup.find('source')
source.insert_after('', target)

# work with children in `<source>`
for child in source:
    print('type:', type(child))

    # duplicate child and add to `<target>`
    child = copy.copy(child)
    target.append(child)

    # convert text and replace in child in `<target>`        
    if isinstance(child, NavigableString):
        new_text = conversions[child.string]
        child.string.replace_with(new_text)

print('--- target ---')
print(target)
print('--- source ---')
print(source)
print('--- soup ---')
print(soup)

Result (little reformated to make it more readable):

type: <class 'bs4.element.Tag'>
type: <class 'bs4.element.NavigableString'>
type: <class 'bs4.element.Tag'>
type: <class 'bs4.element.NavigableString'>

--- target ---

<target>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Wybierz swój produkt
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  Z listy: 
</target>

--- source ---

<source>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Choose your product
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  From a list: 
</source>

--- soup ---

<html><body>
<source>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Choose your product
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  From a list: 
</source>
<target>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Wybierz swój produkt
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  Z listy: 
</target>
</body></html>
furas
  • 134,197
  • 12
  • 106
  • 148
1

I would recommend not to parse XLIFF files with a generic XML parser. Instead, try to find a specialized XLIFF toolkit. There are a few python projects around, but I don't have experience with them (me: Java guy mostly).

martin_wun
  • 1,599
  • 1
  • 15
  • 33
  • Not sure if you're running on MacOS but I built a free to try XLIFF Tool that automatically translates XLIFF using Google Translation API. And it might be a great no-code solution. As a developer I get the instinct to write everything myself but sometimes it's best to run with something that's specially designed for the job :) It's free to try and you can download it [here](https://www.xlifflocalizer.com)! – user1781697 Apr 13 '22 at 04:27