Read ansi file and convert to UTF-8 string

Question

Is there any way to do that with PHP?

The data to be inserted looks fine when I print it out.

But when I insert it in the database the field becomes empty.

try using mysql_real_escape_string() php.net/manual/en/function.mysql-real-escape-string.php maybe the string to be inserted contains characters that are used my MySQL — sikas, Jan 04 '11 at 16:09
i read the string from the txt file and find that some of them return ansii some of them return empty by using mb_detect_encoding($data), any solution — user192344, Jan 04 '11 at 16:15

Mark Bekkers · Accepted Answer · 2011-01-04T16:06:16.517

14

$tmp = iconv('YOUR CURRENT CHARSET', 'UTF-8', $string);

or

$tmp = utf8_encode($string);

Strange thing is you end up with an empty string in your DB. I can understand you'll end up with some garbarge in your DB but nothing at all (empty string) is strange.

I just typed this in my console:

iconv -l | grep -i ansi

It showed me:

ANSI_X3.4-1968
ANSI_X3.4-1986
ANSI_X3.4
ANSI_X3.110-1983
ANSI_X3.110
MS-ANSI

These are possible values for YOUR CURRENT CHARSET As pointed out before when your input string contains chars that are allowed in UTF, you dont need to convert anything.

Change UTF-8 in UTF-8//TRANSLIT when you dont want to omit chars but replace them with a look-a-like (when they are not in the UTF-8 set)

edited Jan 04 '11 at 16:06

answered Jan 04 '11 at 15:50

Mark Bekkers

427
3
5

1

`utf8_encode` converts from ISO 8859-1 to UTF-8. So it can only be used if the input encoding is ISO 8859-1 – Gumbo Jan 04 '11 at 15:51
i try $data = iconv('ASCII', 'UTF-8', $data); it out Message: iconv() [function.iconv]: Detected an illegal character in input string – user192344 Jan 04 '11 at 15:56
ASCII is a subset of UTF-8. If data was actually ASCII (which is not, as the error message states) you wouldn't need to convert. – Álvaro González Jan 04 '11 at 15:57
i read the string from the txt file and find that some of them return ansii some of them return empty by using mb_detect_encoding($data), any solution – user192344 Jan 04 '11 at 16:13
When returning false, simply open the file and look with your eyes for garbage. Remove it by hand and try again. If this works you could write a filter function to run before detecting the encoding. – Mark Bekkers Jan 04 '11 at 16:31

Álvaro González · Answer 2 · 2019-02-19T17:21:02.283

8

"ANSI" is not really a charset. It's a short way of saying "whatever charset is the default in the computer that creates the data". So you have a double task:

Find out what's the charset data is using.
Use an appropriate function to convert into UTF-8.

For #2, I'm normally happy with iconv() but utf8_encode() can also do the job if source data happens to use ISO-8859-1.

Update

It looks like you don't know what charset your data is using. In some cases, you can figure it out if you know the country and language of the user (e.g., Spain/Spanish) through the default encoding used by Microsoft Windows in such territory.

edited Feb 19 '19 at 17:21

answered Jan 04 '11 at 15:52

Álvaro González

142,137
41
261
360

5

I hate those editors that use the word “ANSI”. It’s similar to incorrectly using “Unicode” for UTF-16. – Gumbo Jan 04 '11 at 15:56
The OP told you about the return values he got from mb_detect_encoding. – Henrik Erlandsson May 26 '14 at 19:21
`mb_detect_encoding()` doesn't really do what most people think. In fact it's close to useless. At most, you can use it to distinguish between UTF-8 and UTF-16, but you need to configure it properly. – Álvaro González Feb 17 '16 at 08:05

score 3 · Answer 3 · answered Dec 04 '13 at 12:03

Be careful, using iconv() can return false if the conversion fails.

I am also having a somewhat similar problem, some characters from the Chinese alphabet are mistaken for \n if the file is encoded in UNICODE, but not if it is UFT-8.

To get back to your problem, make sure the encoding of your file is the same with the one of your database. Also using utf-8_encode() on an already utf-8 text can have unpleasant results. Try using mb_detect_encoding() to see the encoding of the file, but unfortunately this way doesn't always work. There is no easy fix for character encoding from what i can see :(

Read ansi file and convert to UTF-8 string

3 Answers3

Update