I am trying to convert a MySQL UTF8mb4 database which contains both Thai and English to Postgresql. This appears to go well until I try and add tsearch. Let me outline the steps taken.
- Install this Thai parser https://github.com/zdk/pg-search-thai
- I restore a copy of production locally from a dump file into MariaDB
- Fix some enum values that trip up Postgresql due to them being missing. MariaDB is happy with them :(
Convert some polygons to text format as pgloader does not deal with them gracefully. -Run pgloader against a fresh postgresql database, testdb
pgloader mysql://$MYSQL_USER:$MYSQL_PASSWORD@localhost/$MYSQL_DB postgresql://$PG_USER:$PG_PASSWORD@localhost/testdb
This appears to work, the site, a Laravel one, appears to function although with some bugs to fix due to differences between MariaDB and Postgresql constraint behavior. However when I try and create text vectors for tsearch, I run into encoding issues. This is where I need advice.
-- trying to create minimal case, dumping Thai names into a temporary table
CREATE EXTENSION thai_parser;
CREATE TEXT SEARCH CONFIGURATION thai_unstemmed (PARSER = thai_parser);
ALTER TEXT SEARCH CONFIGURATION thai_unstemmed ADD MAPPING FOR a WITH simple;
-- to test the parser is working, which it is
SELECT to_tsvector('thai_unstemmed', 'ข้าวเหนียวส้มตำไก่ย่าง ต้มยำกุ้ง in thailand');
-- to recreate my error I did this
CREATE TABLE vendor_names AS SELECT id,name from vendors_i18n;
ALTER TABLE vendor_names ADD COLUMN tsv_name_th tsvector;
-- this fails
UPDATE vendor_names SET tsv_name_th=to_tsvector('thai_unstemmed', coalesce(name, ''));
The error I get is ERROR: invalid byte sequence for encoding "UTF8": 0x80
If I dump that table and restore into a new Postgresql database I do not get the encoding error.
Questions:
- What is the correct encoding to use for UTF8mb4 to Postgresql for pgloader?
- Is there any way, other than the above, of checking the data being correct UTF8 or not?
- Is the problem in the Thai parser tool?
Any suggestions as to how to solve this would be appreciated.
Cheers,
Gordon
PS I'm an experienced developer but not an experienced DBA.