8

I have a table in PostgreSQL with words, but some words have invalid UTF-8 chars like 0xe7e36f and 0xefbfbd.

How I can identify all chars inside words that are invalid and replace they with some symbol like ??

EDIT: My database is in UTF-8, but I think there are double encoding from various other encodings. I think this because when I tried to convert to one type as LATIN1, I get an error saying that some char don't exists in that encoding, when I change to LATIN2 I get the same error, but with another character.

So, what is possible to do to solve this?

Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
  • 1
    Could you pull out the raw bytes and run them through [iconv](http://www.gnu.org/s/libiconv/)? Iconv has quite a few options for dealing with confused encodings. – mu is too short Nov 18 '11 at 17:24
  • 1
    That's what I would do. Chances are big that the faulty rows were a result of double encoding. Things gets messy when the encoding varies on a per-row basis. In most cases, my first step would be to dump to ascii and sort it out. – wildplasser Nov 18 '11 at 17:53
  • @wildplasser Look at my edit, I think is a problem of double encoding. – Renato Dinhani Nov 18 '11 at 18:30
  • @muistooshort Look at my edit. Looks like double encoding. – Renato Dinhani Nov 18 '11 at 18:30
  • 1
    I can't think of a way for Postgres to have allowed you to insert invalid UTF-8 sequences in an UTF-8 enocoded DB – Milen A. Radev Nov 18 '11 at 19:04
  • @MilenA.Radev Maybe it don't allowed, but the content comes from the web. I think the original content can be wrong encoded and I'm trying to revert this. – Renato Dinhani Nov 18 '11 at 19:06
  • The whole idea of Unicode (and Unicode encodings as UTF-8) is to store different scripts, which is not possible with traditional 8-bit encodings like LATIN1, LATIN2 etc. So it's completely possible to have some symbols in UTF-8 that are not LATIN1 and another symbols that are not in LATIN2 and even others that are not in KOI8-R etc. – Milen A. Radev Nov 18 '11 at 19:07
  • Let me clarify (sorry about that) - the point of my previous 2 comments is to point out that: on one hand you can't have invalid UTF-8 sequences in UTF-8 encoded DB, and on the other - yes, most probably you have texts in different scripts in your UTF-8 encoded DB that can't be converted as a whole to 8-bit encodings as LATIN1 etc. – Milen A. Radev Nov 18 '11 at 19:15
  • @MilenA.Radev Yes, I think is happening the situation in yout last comment. I think I will need write some procedure to solve this because with a single query will always have erros because the original encoding varies. – Renato Dinhani Nov 18 '11 at 19:19
  • You may avoid converting your DB contents to LATIN1, LATIN2 etc. and show it as UTF-8 (or whatever Unicode encoding your client system supports). That's easy on the web (all modern browsers and client OSs support Unicode and various scripts) but may be not so easy in your case. – Milen A. Radev Nov 18 '11 at 19:24
  • @MilenA.Radev I tried to provide some solution. At least, works for me. Look at the answer. – Renato Dinhani Nov 19 '11 at 04:26

1 Answers1

3

Usage

It's a solution for my specific case, but maybe with some modifications can help another people.

Usage

SELECT fix_wrong_encoding('LATIN1');

Function

-- Convert words with wrong encoding
CREATE OR REPLACE FUNCTION fix_wrong_encoding(encoding_name VARCHAR)
RETURNS VOID
AS $$
DECLARE     
    r RECORD;
    counter INTEGER;
    token_id INTEGER;
BEGIN
    counter = 0;
    FOR r IN SELECT t.id, t.text FROM token t
    LOOP
        BEGIN
            RAISE NOTICE 'Converting %', r.text;
            r.text := convert_from(convert_to(r.text,encoding_name),'UTF8');
            RAISE NOTICE 'Converted to %', r.text;
            RAISE NOTICE 'Checking existence.';
            SELECT id INTO token_id FROM token WHERE text = r.text;             
            IF (token_id IS NOT NULL) THEN
                BEGIN
                    RAISE NOTICE 'Token already exists. Updating ids in textblockhastoken';
                    IF(token_id = r.id) THEN
                        RAISE NOTICE 'Token is the same.';
                        CONTINUE;
                    END IF;
                    UPDATE textblockhastoken SET tokenid = token_id
                    WHERE tokenid = r.id;
                    RAISE NOTICE 'Removing current token.';
                    DELETE FROM token WHERE id = r.id;
                END;
            ELSE
                BEGIN
                    RAISE NOTICE 'Token don''t exists. Updating text in token';
                    UPDATE token SET text = r.text WHERE id = r.id;
                END;
            END IF;
            EXCEPTION WHEN untranslatable_character THEN
                --do nothing
            WHEN character_not_in_repertoire THEN
                --do nothing
            END;
            counter = counter + 1;
            RAISE NOTICE '% token converted', counter;
    END LOOP;
END
$$
LANGUAGE plpgsql;
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199