Convert non english characters into Unicode (UTF-8)

Question

I copied large amount of text from another system to my PC. When I viewed the text in my PC, it looked weird. So I copied all the fonts from the other PC and installed them in mine too. Now the text looks okay, but actually it seems that is not in Unicode. For example, if I copy the text and paste in another UTF-8 supported editor such as Notepad++, I get English characters ("bgah;") only like shown below. enter image description here

How to convert this whole text into unicode text, like the one below. So I can copy the text and paste anywhere else.

பெயர்

The above text was manually obtained using http://www.google.com/transliterate/indic/Tamil

I need this conversion to be done, so I can copy them into database tables.

If you can create a table of each character code and which Unicode point it corresponds with, somebody can help you create a program which performs the translation. Until then, this is off-topic for SO. — tripleee, Jan 28 '12 at 17:40
It looks like the problem is a matter of text being in a non-standard character encoding, and has nothing to do with UTF-8. Added tag "character-encoding", removed tag "utf-8". — Jim DeLaHunt, Jan 30 '12 at 02:35

score 5 · Answer 1 · answered Jan 29 '12 at 10:10

'Ja-01' is a font with a custom 'visual encoding'.

That is to say, the sequence of characters really is "bgah;" and it only looks like Tamil to you because the font's shapes for the Latin characters bg look like பெ.

This is always to be avoided, because by storing the content as "bgah;" you lose the ability to search and process it as real Tamil, but this approach was common in the pre-Unicode days especially for less-widespread scripts without mature encoding standards. This application probably predates widespread use of TSCII.

Because it is a custom encoding not shared by any other font, it is very unlikely you will be able to find a tool to convert content in this encoding to proper Unicode characters. It does not appear to be any standard character ordering, so you will have to look at the font (eg in charmap.exe) and note down every character, find the matching character in Unicode and map between them.

For example here's a trivial Python script to replace characters in a file:

mapping= {
    u'a': u'\u0BAF',   # Tamil letter Ya
    u'b': u'\u0BAA',   # Tamil letter Pa
    u'g': u'\u0BC6',   # Tamil vowel sign E (combining)
    u'h': u'\u0BB0',   # Tamil letter Ra
    u';': u'\u0BCD',   # Tamil sign virama (combining)
    # fill in the rest of the mapping information here!
}

with open('ja01data.txt', 'rb') as fp:
    data= fp.read().decode('utf-8')
for char in mapping:
    data= data.replace(char, mapping[char])
with open('utf8data.txt', 'wb') as fp:
    fp.write(data.encode('utf-8'))

score 4 · Answer 2 · answered Jan 28 '12 at 14:41

4

The font you found is getting you into trouble. The actual cell text is "bgah;", it gets rendered to பெயர் because you found a font that can work with 8-bit non-Unicode characters. So reading it or pasting it into Notepad++ is going to produce "bgah;" since that's the real text. It can only ever be rendered properly again by forcing the program that displays the string to use that same font.

Ditch the font and enter Unicode so it looks like this:

enter image description here

answered Jan 28 '12 at 14:41

Hans Passant

922,412
146
1,693
2,536

yes you are right. But how to convert my text from my old font to the an unicode one? – Raj Jan 28 '12 at 14:45
You'll have to convert your text from the old *encoding*. Which doesn't look like TSCII. http://www.tamil.net/tscii/charset17.gif Code page 57004 is Tamil but is not a match either. No idea, ask whomever generated the text. – Hans Passant Jan 28 '12 at 14:59
The text is generated by a 10 year old VB app, we don't have support for the software now. – Raj Jan 28 '12 at 15:11
Clearly I can't help you find the programmer of this app, he's a heckofalot closer to you than me. The only other thing you could do is reverse-engineer the encoding from the font you got. Have it render all possible ASCII codes and see what Tamil glyph you get. – Hans Passant Jan 28 '12 at 15:14

score 2 · Accepted Answer · answered Jan 25 '13 at 16:57

2

"bgah" looks like a Baamini based system, which is pre-unicode. It was popular in Canada (and the SL Tamil diaspora in general) in the 90s.

As the others mentioned, it looks like a custom visual encoding that mimics the performance of a foreign script while maintaining ASCII encoding.

Google "Baamini to unicode convertor". The University of Colombo seems to have put one up: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_b-u.xml

Let me know if this works. If not, I can ask around and get something for you.

answered Jan 25 '13 at 16:57

Ashwin Balamohan

3,303
2
25
47

Hey that was super cool.. Worked perfectly.. had issue only with "ர்" which I corrected manually. – Raj Jan 26 '13 at 07:12
I'm glad it was helpful! Yeah, the ர் can be ambiguous. When most people write it (i.e. on paper), it looks like an 'aravu' (ா, but without that circle) with a 'pulli' ( the dot, ் - also without the circle) on top of it. That looks like how 'peyar' (பெயர்) was written in the text you posted. – Ashwin Balamohan Jan 27 '13 at 02:46
Riyafa, try this site: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/t1.html – Ashwin Balamohan Apr 02 '16 at 19:33

score 0 · Answer 4 · answered Jan 28 '12 at 14:21

0

You could first check whether the encoding is TSCII, as this sounds most probable. It is an 8-bit encoding, and the fonts you copied are probably based on that encoding. Check out whether the TSCII to UTF-8 converter at SourceForge is suitable. The project there is called “Any Tamil Encoding to Unicode” but they say that only TSCII is supported for now.

answered Jan 28 '12 at 14:21

Jukka K. Korpela

195,524
37
270
390

I tried this - Copied "bgah;" and put it in IN.txt and ran "ascii2unicode.exe" via commandline. But the program closes unexpectedly. I tried with Windows XP compatibility mode. Still it crashes. – Raj Jan 28 '12 at 14:43
The TSCII reference documents indicate that the encoding is compatible with ASCII; the Tamil characters all have the 8th bit set. Thus it seems that the fonts use another encoding. – tripleee Jan 28 '12 at 17:29

Convert non english characters into Unicode (UTF-8)

4 Answers4