6

I'm stuck trying to find a cause for

invalid byte sequence for encoding "UTF8".

It is an C program using libpq. I'm using PQexecParams to execute the SQL query.

The offending byte sequence is completely random, sometimes the command even runs ok. I thought I must have a memory allocation issue somewhere, but even I specify all the parameters as static strings, I still receive the error with a random byte sequence. What's more, the same query with the same parameters runs ok when I create a small test program. It even runs ok from other places in the applications. So I'm completely stuck. I verified all the possible sources for the error like client_encoding etc, but could not find the source of the error. What is confusing me is that the offending byte sequence is random, even though the query parameters don't change. Moreover, when I check the postgres log, the query and its parameters appear to be correct.

I'm trying to update a record in the following table:

CREATE TABLE public.contacts
(
    contactid integer NOT NULL DEFAULT nextval('contacts_contactid_seq'::regclass),
    paperid integer,
    pos character varying(50) COLLATE pg_catalog."default",
    title character varying(10) COLLATE pg_catalog."default",
    firstname character varying(20) COLLATE pg_catalog."default",
    lastname character varying(25) COLLATE pg_catalog."default",
    func character varying(25) COLLATE pg_catalog."default",
    tel1 text COLLATE pg_catalog."default",
    tel2 text COLLATE pg_catalog."default",
    fax1 text COLLATE pg_catalog."default",
    fax2 text COLLATE pg_catalog."default",
    email1 character varying(50) COLLATE pg_catalog."default",
    email2 character varying(50) COLLATE pg_catalog."default",
    maincontact boolean DEFAULT false,
    publdatacontact boolean DEFAULT false,
    invcontact boolean DEFAULT false,
    queries_recipient boolean,
    contact_log text COLLATE pg_catalog."default",
    salesforceid character(18) COLLATE pg_catalog."default",
    fakelastname boolean NOT NULL DEFAULT false,
    CONSTRAINT contacts_pk PRIMARY KEY (contactid),
    CONSTRAINT contacts_paperid_fkey FOREIGN KEY (paperid)
        REFERENCES public.papers (paperid) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE CASCADE
);

Here is an actual code:

    const char* pparams[16] = {
NULL,
NULL,
"1702",
NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
NULL,
"14340"
    };



gchar *query="UPDATE contacts SET Pos=$1::varchar,Title=$2::varchar,PaperID=$3::int,FirstName=$4::varchar,LastName=$5::varchar,Func=$6::varchar,Tel1=$7::text,Fax1=$8::text,Email1=$9::varchar,Tel2=$10::text,Fax2=$11::text,Email2=$12::varchar,MainContact=$13::boolean,PublDataContact=$14::boolean,InvContact=$15::boolean WHERE ContactID=$16::int";

      result = PQexecParams(conn, query, 16, NULL, pparams, ssizes, bbinary, 0);

An excerpt from Postgres log:

Jan 26 08:40:57 ip-172-16-10-94 postgres[11334]: [113-1] 2021-01-26 09:40:57.505 CET [11334] jira@project-syndicate LOG:  execute <unnamed>: 

    UPDATE contacts SET Pos = $1::varchar, Title = $2::varchar, PaperID = $3::int, FirstName = $4::varchar, LastName = $5::varchar, Func = $6::varchar, Tel1 = $7::text, Fax1 = $8::text, Email1 = $9::varchar, Tel2 = $10::text, Fax2 = $11::text, Email2 = $12::varchar, MainContact = $13::boolean, PublDataContact = $14::boolean, InvContact = $15::boolean WHERE ContactID = $16::int
    Jan 26 08:40:57 ip-172-16-10-94 postgres[11334]: [113-2] 2021-01-26 09:40:57.505 CET [11334] jira@project-syndicate DETAIL:  parameters: $1 = NULL, $2 = NULL, $3 = '1702', $4 = NULL, $5 = NULL, $6 = NULL, $7 = NULL, $8 = NULL, $9 = NULL, $10 = NULL, $11 = NULL, $12 = NULL, $13 = NULL, $14 =  NULL, $15 = NULL, $16 = '14340'
    Jan 26 08:40:57 ip-172-16-10-94 postgres[11334]: [114-1] 2021-01-26 09:40:57.544 CET [11334] jira@project-syndicate ERROR:  invalid byte sequence for encoding "UTF8": 0x80

Any ideas as to what could be a cause of the error?

jira
  • 3,890
  • 3
  • 22
  • 32
  • This means you stored non-Unicode text (Latin 1 perhaps?) to a UTF8 field. There's nothing random about this. When the database or your program tried to read those bytes and convert them to text, it found byte values that are invalid for UTF8 – Panagiotis Kanavos Jan 27 '21 at 08:44
  • It is an update query, It would fail even, if it contains only integers for the foreign key. Absolutely same query with the same parameters will work if I extract it to a separate program. I'll add more details to my question. – jira Jan 27 '21 at 08:52
  • Apart from that, you haven't provided any code or data examples so it's impossible to guess what's wrong. Perhaps you have a bug in the program that stores text? Or the program that *reads* the data tried to use UTF8 even though the field uses a different collation? `even I specify all the parameters as static strings` that doesn't say much either - unless you use Unicode literals with the `u8` prefix, or use a `char8_t` array, what you have isn't UTF8 and needs conversion. – Panagiotis Kanavos Jan 27 '21 at 08:53
  • You won't get any encoding errors from queries with integers. Post your actual code and queries. Again, unless you use a Unicode string or character type, you are working with non-Unicode strings. `char` isn't a Unicode type, it's a `anything goes and good luck` type, whose behavior depends on the machine's locale settings. – Panagiotis Kanavos Jan 27 '21 at 08:54
  • The [Character Literals](https://en.cppreference.com/w/cpp/language/character_literal) page shows how you can actually specify Unicode strings and the resulting types. In general, Unicode support in C++ is still a mess and C is even worse. UTF8 support is still half-baked. That's why so many Linux programs require setting `LC_ALL` to UTF8. And why there are so many SO questions from data scientists whose R or Python 2 programs choked when they tried to process eg Russian or Chinese data for the first time – Panagiotis Kanavos Jan 27 '21 at 08:58
  • Added code and table definition – jira Jan 27 '21 at 09:06
  • Is it saying the invalid sequence is 0x0? I get that one randomly as well. Never been able to reproduce it intentionally. – coladict Jan 27 '21 at 09:11
  • Your bug looks a lot like a memory corruption issue. I don't know the exact API of your database library, but `sizeof(NULL)` (which is probably `4` or `8` depending on your platform) listed as the sizes of empty strings definitely looks fishy to me. Double check how you are supposed to pass a "NULL" arguments to `PQexecParams`. – LeGEC Jan 27 '21 at 09:13
  • The bad bytes change on each run. Sometimes it will even pass. – jira Jan 27 '21 at 09:14
  • @coladict if you use `char` you can reproduce this by using any character outside the 7-bit range. `char` doesn't know about encodings and depends on the machine's locale. UTF16 strings use at least two bytes with the first one being `0x00` for the US-ASCII range. `0x80` never occurs in UTF8. If you want to avoid such issues, you need to use the Unicode types. – Panagiotis Kanavos Jan 27 '21 at 09:15
  • The thing is the same code works, if I create small test application with only the shown code, Never an error. I tried every possible way to specify the parameters. – jira Jan 27 '21 at 09:16
  • @Jira passing strings instead of the actual numeric data is a pretty bad practice to begin with. Imagine trying to pass decimals or dates as strings instead of floats or `DateTime` (in languages that have a date type). Perhaps it's a failing of the library you use, but using `char* pparams` is definitely a bad idea. – Panagiotis Kanavos Jan 27 '21 at 09:17
  • @jira your code works only by chance. Only because you haven't tried to pass floats yet, in which case `3,5` would have to be parsed based on the server's locale. – Panagiotis Kanavos Jan 27 '21 at 09:18
  • Your problem description looks like there is a problem somewhere else in your code. Perhaps a memory management problem somewhere writes garbage over parts of your heap or stack. I'd use a tool like `valgrind` to look for problems. Such bugs can be difficult to find, because the problem often manifests far away from the actual problem. – Laurenz Albe Feb 02 '21 at 10:23
  • @PanagiotisKanavos I get that from a java project and I know for sure that if there was a character of code `0x0000` in the value it would have crashed before before the query was sent. At one point I even knew where that crash should be if the problem was in the data. – coladict Aug 04 '21 at 14:38
  • @coladict `0x00`/`NUL` is valid in UTF8. You get an exception in Java because Java uses a [Modified UTF8 encoding instead](https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8), replacing `0x00` with `0xC0 0x80`. Strings rarely contain NULs, so this isn't a problem most of the time – Panagiotis Kanavos Aug 05 '21 at 10:35

2 Answers2

2

first of all you are using postgres, and when you create tables in postgres with type character varying you dont have to specify the length. that is what varying stands for. it will consume as many bytes as it needs

very possible that the strings you put in your db are encoded in such way ex. double byte characters, so when you are trying to insert a 26 bytes string into a 25 length character column the last byte is not valid utf8

so i suggest you recreate your table ommiting all lengths on character varying columns and try again.

then check your system locale and the db's locale, I would suggest you create your db using template0 and adding a usable from your system local depending on your language.UTF-8

then check you code file encoding as well to be utf8 using file

if nothing works let me know

I have tested it with the create table you posted but without the foreign keys and with the following code

int main() {
    const char conninfo[] = "postgresql://postgres@localhost?port=5432&dbname=libpq_demo";
    PGconn *conn = PQconnectdb(conninfo);
    if (PQstatus(conn) != CONNECTION_OK) {
        printf("Connection to database failed: %s",  PQerrorMessage(conn));
        PQfinish(conn);
        return 1;
    }
    else {
        printf("%s", "Connection to database succeed.\n");
    }

    const char* pparams[16] = {
        NULL,
        NULL,
        "1702",
        NULL,
        NULL,
        NULL,
        NULL,
        NULL,
        NULL,
        NULL,
        NULL,
        NULL,
        NULL,
        NULL,
        NULL,
        "14340"
    };

    int ssizes[16] = {
        sizeof(NULL),
        sizeof(NULL),
        4,
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        sizeof(NULL),
        5
    };

    int bbinary[16]= {
        1,
        1,
        0,
        1,
        1,
        1,
        1,
        1,
        1,
        1,
        1,
        1,
        1,
        1,
        1,
        0
    };

    gchar *query="UPDATE contacts SET Pos=$1::varchar,Title=$2::varchar,PaperID=$3::int,FirstName=$4::varchar,LastName=$5::varchar,Func=$6::varchar,Tel1=$7::text,Fax1=$8::text,Email1=$9::varchar,Tel2=$10::text,Fax2=$11::text,Email2=$12::varchar,MainContact=$13::boolean,PublDataContact=$14::boolean,InvContact=$15::boolean WHERE ContactID=$16::int";

    PQexecParams(conn, query, 16, NULL, pparams, ssizes, bbinary, 0);
}

and compiled it with

gcc foo.cc -o foo-demo -I/usr/include/postgresql -I/usr/include/glib-2.0 -lpq

apart from the warning about the gchar, which I am not sure why you use it but anyway, everything works perfect. I have tested it about 10K times

you should consider looking at the

CONSTRAINT contacts_paperid_fkey FOREIGN KEY (paperid)
REFERENCES public.papers (paperid) MATCH SIMPLE

maybe it has nothing to do the code, but with the fact that you are passing a value there which is conflicting

Tch
  • 1,055
  • 5
  • 11
  • `very possible that the strings you put in your db are encoded in such way ex. double byte characters, so when you are trying to insert a 26 bytes string into a 25 length character column the last byte is not valid utf8`. No. `varchar(n)` limits the number of *characters* allowed, not the number of bytes. The latter would be pretty dumb for multi-byte encodings. – Erwin Brandstetter Feb 06 '21 at 14:06
0

The problem was my blindness. The error was not caused by the query in the question, but one that was run immediatelly after.

jira
  • 3,890
  • 3
  • 22
  • 32