I've been tasked with taking a Chinese translation of English HTML, and re-styling it to match the original English HTML. The Chinese "HTML" no longer has any of the original Divs or styling of the English HTML. The Chinese character encoding is GB2312.
I want to create a program/script to automate this since there are 182 HTML files that need re-styling and I don't want to do it by hand. I'm most familiar with PHP but am open to anything.
Here is a one of the English HTML files
Here is the equivalent Chinese HTML file
As you can see, they're very different. If this was only a couple files, I'd just copy the Chinese characters and paste them into the matching DIV - replacing the English text at the same time. Then change the encoding to GB2312 in the <head>
so that the Chinese characters displayed properly.
eg:
<meta charset="gb2312">
My thought as far as converting the two is to parse through the Chinese file, find each independent string of Chinese, stuff each string into it's own variable, and then parse through the equivalent English file, locate strings of English text, and replace them with the equivalent Chinese characters from the variable. Adding exceptions for ® and ©.
Anyone know of how I might begin to do this? Do most scripting languages even support finding non-UTF8 characters?