2

I am using this collection of Canadian postcodes. To my disappointment, many French names are misencoded, so that "Trois-Rivières" shows up as "Trois-Rivi�Res," and so on. This appears to be a mistake in the database itself rather than a fault with my text editor/browser.

Is there a way to encode these accent markers correctly? There are far too many of them to manually look up and replace.

Ian Martin
  • 95
  • 6
  • In the *general* case, no, we can't guess what correct text some incorrect text is supposed to represent. But in some limited scenarios, you do know enough to recover. For example, if the target system is UTF-8 and some of the incorrect input is in Latin-1, and there are no other possibilities, you can recover the text with relative ease. Please [edit] your question to clarify whether this seems like a feasible assumption. – tripleee Dec 31 '18 at 06:07
  • Public data like this is probably available from an unmangled source, perhaps in a form where the useful lat/long information from this dumpster can be merged with a correct textual representation of the postal codes. – tripleee Dec 31 '18 at 06:11
  • 2
    Quick googling gets me https://www.aggdata.com/free/canada-postal-codes – tripleee Dec 31 '18 at 06:14
  • Your database might have [a way to convert between encodings on the fly](https://www.postgresql.org/docs/8.4/multibyte.html). You might also do the query in a program and then convert with a library such as libiconv. Or you might dump to a format such as CSV, convert that, and then import. – Davislor Dec 31 '18 at 07:29
  • 2
    Unfortunately, looking at the raw downloaded .CSV file from your link, the data literally contains the Unicode code point U+FFFD (REPLACEMENT CHARACTER) encoded in UTF-8 in thousands of entries. The original code points are lost. The original data was probably read with the wrong encoding and encoding errors were replaced with this character, then entered into the database using UTF-8 encoding. – Mark Tolonen Dec 31 '18 at 14:21

2 Answers2

1

The words seem to contain that replacement question mark, '\uFFFD'. Which means data loss, but also a simple marker to find those values.

Make a table with as primary key the wrong string. Repairing might be easier than thought. In the example the capital R after the qm instead of r might stem from word wise capitalisation. i?r is most likely i e-gauche r.

A regexp replace then will do.

However it might be easier to reimport a corrected database dump. The repair does not need to be in SQL.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
0

In php you can use the following to remap from ascii (latin 1) encoded text to utf8.

$str = mb_convert_encoding($str,'ISO-8859-1', 'UTF-8');
david-giorgi
  • 145
  • 2
  • 10