1

I am trying to convert a MySQL UTF8mb4 database which contains both Thai and English to Postgresql. This appears to go well until I try and add tsearch. Let me outline the steps taken.

  • Install this Thai parser https://github.com/zdk/pg-search-thai
  • I restore a copy of production locally from a dump file into MariaDB
  • Fix some enum values that trip up Postgresql due to them being missing. MariaDB is happy with them :(
  • Convert some polygons to text format as pgloader does not deal with them gracefully. -Run pgloader against a fresh postgresql database, testdb

    pgloader mysql://$MYSQL_USER:$MYSQL_PASSWORD@localhost/$MYSQL_DB postgresql://$PG_USER:$PG_PASSWORD@localhost/testdb

This appears to work, the site, a Laravel one, appears to function although with some bugs to fix due to differences between MariaDB and Postgresql constraint behavior. However when I try and create text vectors for tsearch, I run into encoding issues. This is where I need advice.

-- trying to create minimal case, dumping Thai names into a temporary table
CREATE EXTENSION thai_parser;
CREATE TEXT SEARCH CONFIGURATION thai_unstemmed (PARSER = thai_parser);
ALTER TEXT SEARCH CONFIGURATION thai_unstemmed ADD MAPPING FOR a WITH simple;

-- to test the parser is working, which it is
SELECT to_tsvector('thai_unstemmed', 'ข้าวเหนียวส้มตำไก่ย่าง ต้มยำกุ้ง in thailand');

-- to recreate my error I did this
CREATE TABLE vendor_names AS SELECT id,name from vendors_i18n;
ALTER TABLE vendor_names ADD COLUMN tsv_name_th tsvector;

-- this fails
UPDATE vendor_names SET tsv_name_th=to_tsvector('thai_unstemmed', coalesce(name, ''));

The error I get is ERROR: invalid byte sequence for encoding "UTF8": 0x80

If I dump that table and restore into a new Postgresql database I do not get the encoding error.

Questions:

  • What is the correct encoding to use for UTF8mb4 to Postgresql for pgloader?
  • Is there any way, other than the above, of checking the data being correct UTF8 or not?
  • Is the problem in the Thai parser tool?

Any suggestions as to how to solve this would be appreciated.

Cheers,

Gordon

PS I'm an experienced developer but not an experienced DBA.

gordonbanderson
  • 221
  • 2
  • 5

1 Answers1

0

Have you tried manually importing the dataset row-by-row to see which rows are successfully imported and which ones fail? If some imports succeed but others fail it would seem to be a data integrity problem.

If none of the records are successfully imported it's obviously an encoding problem.