Fixing invalid UTF8 characters

Question

I'm importing a txt file in to an sqlite database and then outputting those values in json format using php

json_encode fails, complaining about illegal characters. I tracked it down to the two accented characters in the string terrains à bâtir - this string renders fine when I open the file in Sublime but in Textedit the string is shown as terrains ‡ b‚tir

Some info about the file and its contents

file -i file.txt tells me text/plain; charset=us-ascii
mb_detect_encoding() on a valid string tells me it is ASCII
mb_detect_encoding() on a invalid string tells me it is UTF-8
hexdump -C file.txt | grep terrains shows the characters as dots:

00a4eb30 7c 74 65 72 72 61 69 6e 73 20 e0 20 62 e2 74 69 ||terrains . b.ti|

cat file.txt | tail -c +1671338 | head -c 20 shows the characters as � and they appear in my sqlite GUI the same way.

ns � b�tir|11111|AAA

I know it's possible to use iconv to "fix" this using the TRANSLIT or IGNORE options but then I end up with something different than what it's supposed to be.

$encoding = mb_detect_encoding($row[2]);
if($encoding !== 'ASCII') {
    $enc = mb_detect_encoding($row[2]);
    $converted = iconv('UTF-8', 'ASCII//IGNORE', $row[2]);
    print_r($converted);
}

Using IGNORE (obviously) outputs terrains btir and with TRANSLIT the method complains about iconv(): Detected an illegal character in input string

My goal is to revert these characters to their proper accented form, using PHP. How can I do this? I'm guessing the hexdump output provides some clues, but I can't figure out which bytes are the problematic ones or how to fix them.

Guessing and programs/functions that guess will not lead to success. Start with, what is the character encoding of the file? (It is not ASCII if you say the proper interpretation of the bytes includes à and â.) — Tom Blodget, Jul 17 '18 at 16:54
@TomBlodget file -i file.txt tells me charset=us-ascii. Is there another way I can check? I'm assuming that whoever authored the file (from a govt public dataset) made some mistakes wrt encoding — stef, Jul 18 '18 at 07:04
`file` saying that means it compatible with just about every one except UTF-16, UTF-32 and EBCDIC. But that the characters with diacritics are not on the file! Where do they come from? — Tom Blodget, Jul 20 '18 at 10:22
Data's coming from a govt public data set. There are other accents in the file that render file so it's probably just badly authored. I found a workaround. Post an answer and I'll accept it. Thanks for your help! — stef, Jul 20 '18 at 14:00
I don't know what the answer is. You would be the best person to write it. — Tom Blodget, Jul 20 '18 at 15:21

Fixing invalid UTF8 characters

0 Answers0

Linked