Python/PHP SQLite querying Polish letter Ł/ł in FTS4/FTS5

Question

Since SQLite FTS4/FTS5 tokenizer=unicode61 gives us:

a=A=ą=Ą=ä=Ä ...
z=ż=ź=Z=Ż=Ź=Ž=ž ...
etc...

Why not l=ł=L=Ł ??? Isn't it a bug?

How to query SQLite on keybord not having Polish chars ł/Ł? For example querying for name Żabczyński like "zabczynski" - got result, but for name Włast like "wlast" - 0 result (should be like hundreds...) I have my walkaround in PHP, but it does not working with words with l and ł in it, like 'opłacalny'.

<?
$q = $_POST["q"];
//
$pat = '/(\b\w*[lł]\w*\b)/iu';
    $q = preg_replace_callback($pat,function($macz){
        return "(" . str_replace("ł","l",$macz[1]) . "* OR " . str_replace("l","ł",$macz[1]) . "*)";
    },$q);
// so query 'andrzej wlast' looks 'andrzej (wlast* OR włast*)'
...
    $sql = "SELECT ...";
    $pdo = $db->prepare($sql);
    //
    $pdo->execute([":q" => "$q*"]);
    //
    $odp = $pdo->fetchAll(PDO::FETCH_ASSOC);
?>

Any idea? You can't set encoding in sqlite like utf8_general_ci, utf8_polish_ci, utf8_unicode_ci... Or yes, it's possible?

Is there a way to solve it in Python? No ICU on platform (shared server).

Hope so. But I allready used regex. It works for words only with 'ł', so 'płakała' = 'plakala' but 'leciał' != 'lecial'. — crooner, Aug 16 '18 at 10:56

Amadan · Answer 1 · 2018-08-16T11:09:54.793

Unfortunately, no, SQLite doesn't have the collating tables like MySQL, because it would bloat what is supposed to be a very small and portable library.

You can transform your queries into something like this:

SELECT * FROM foo WHERE word REGEXP '^[ZŻ]abczy[nń]ski$';
SELECT * FROM foo WHERE word REGEXP '^W[lł]ast$';

It is quite easy in Python:

def collatify(string, equivalents):
    for original, replacement in equivalents.items():
        string = string.replace(original, '[%s%s]' % (original, replacement))
    return string

collatify('Żabczyński', { "Ż": "Z", "ń": "n" })

Again unfortunately, this will make it impossible to use indices for search on these fields.

A better approach is to do the opposite operation, "asciify" your strings, and enter them into the database as an additional column (with its own index!); then "asciify" your query, and watch it work. Even better, see if your "asciified" query is the same as the original; if it is, use the "asciified" column (as the user entered ASCII characters only); if they differ, then the user entered Polish-specific characters, and would presumably enter them all correctly, so use the original column. This way, if the user enters "Żabczyński", you search for "Żabczyński" in the original column and find it there. If the user enters "Zabczynski", assume it might be asciified, and search in the asciified column; it would find "Żabczyński", "Zabczyński", "Żabczynski" and "Zabczynski" if they were there. If the user enters "Zabczyński" or "Żabczynski", presumably they should know Polish, so search in the original column and return no results. All this win comes at the expense of only storing one more copy of your column.

Thanks, but it's a FTS query. A bunch of lirycs, names and titles. https://staremelodie.pl/ — crooner, Aug 16 '18 at 10:58
Ah right :) Anyway, the second, better approach described in the last paragraph should work transparently and fast with full-text search. — Amadan, Aug 16 '18 at 11:01
@Amadan - Thank you, but yes, I know this sollution... I allready have two columns html/non-html for over 4K songs. Still growing. Not mentioning authors, composers... Names, titles... It's gonna be a BIG DATA sqlite db. :) — crooner, Aug 16 '18 at 11:25
Just one letter, one char... What a shame. Must be a Unicode bug, for shure... :( — crooner, Aug 16 '18 at 12:38

score 0 · Answer 2 · answered Jan 20 '19 at 14:16

0

Move to MySQL or Postgres. SQLite has its limits.

answered Jan 20 '19 at 14:16

Michał Leon

2,108
1
15
15

Python/PHP SQLite querying Polish letter Ł/ł in FTS4/FTS5

2 Answers2