0

I've been tasked with taking a Chinese translation of English HTML, and re-styling it to match the original English HTML. The Chinese "HTML" no longer has any of the original Divs or styling of the English HTML. The Chinese character encoding is GB2312.

I want to create a program/script to automate this since there are 182 HTML files that need re-styling and I don't want to do it by hand. I'm most familiar with PHP but am open to anything.

Here is a one of the English HTML files

Here is the equivalent Chinese HTML file

As you can see, they're very different. If this was only a couple files, I'd just copy the Chinese characters and paste them into the matching DIV - replacing the English text at the same time. Then change the encoding to GB2312 in the <head> so that the Chinese characters displayed properly. eg:

<meta charset="gb2312">

My thought as far as converting the two is to parse through the Chinese file, find each independent string of Chinese, stuff each string into it's own variable, and then parse through the equivalent English file, locate strings of English text, and replace them with the equivalent Chinese characters from the variable. Adding exceptions for &reg and &copy.

Anyone know of how I might begin to do this? Do most scripting languages even support finding non-UTF8 characters?

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
lime517
  • 26
  • 1
  • 7

1 Answers1

0

I'm not familiar with PHP, but only with a C#.

As I don't really see the whole picture (such as html hierarchy of all the files you have and differences between files, if there are), I can only advise you...

You can:

  1. You can run through your files within the loop.
  2. Use a 3'rd party library (such as NTextCat or Language Detection API) to read them, so you could easily take whatever you want (text, attributes, determine patterns) and store it temporary, as you wanted.
  3. Use a 3'rd party library (such as Html Agility Pack) to determine a language of this file (actually this a part of "step 2", as you want to parse only html files with Chinese).
  4. Two options:
    1. Find equivalent file in English and replace the texts (you can use "step 2"). I guess you'll know better then us, how to figure, what text you should replace with what...
    2. Or you can prepare "MVC style" template(s), and use 3'rd party library (such as RazorEngine) for templating.

Hope this will help you. If you have any questions, fill free to ask :)

neoselcev
  • 138
  • 12