1

Since SQLite FTS4/FTS5 tokenizer=unicode61 gives us:

a=A=ą=Ą=ä=Ä ...
z=ż=ź=Z=Ż=Ź=Ž=ž ...
etc...

Why not l=ł=L=Ł ??? Isn't it a bug?

How to query SQLite on keybord not having Polish chars ł/Ł? For example querying for name Żabczyński like "zabczynski" - got result, but for name Włast like "wlast" - 0 result (should be like hundreds...) I have my walkaround in PHP, but it does not working with words with l and ł in it, like 'opłacalny'.

<?
$q = $_POST["q"];
//
$pat = '/(\b\w*[lł]\w*\b)/iu';
    $q = preg_replace_callback($pat,function($macz){
        return "(" . str_replace("ł","l",$macz[1]) . "* OR " . str_replace("l","ł",$macz[1]) . "*)";
    },$q);
// so query 'andrzej wlast' looks 'andrzej (wlast* OR włast*)'
...
    $sql = "SELECT ...";
    $pdo = $db->prepare($sql);
    //
    $pdo->execute([":q" => "$q*"]);
    //
    $odp = $pdo->fetchAll(PDO::FETCH_ASSOC);
?>

Any idea? You can't set encoding in sqlite like utf8_general_ci, utf8_polish_ci, utf8_unicode_ci... Or yes, it's possible?

Is there a way to solve it in Python? No ICU on platform (shared server).

crooner
  • 69
  • 2
  • 8

2 Answers2

1

Unfortunately, no, SQLite doesn't have the collating tables like MySQL, because it would bloat what is supposed to be a very small and portable library.

You can transform your queries into something like this:

SELECT * FROM foo WHERE word REGEXP '^[ZŻ]abczy[nń]ski$';
SELECT * FROM foo WHERE word REGEXP '^W[lł]ast$';

It is quite easy in Python:

def collatify(string, equivalents):
    for original, replacement in equivalents.items():
        string = string.replace(original, '[%s%s]' % (original, replacement))
    return string

collatify('Żabczyński', { "Ż": "Z", "ń": "n" })

Again unfortunately, this will make it impossible to use indices for search on these fields.

A better approach is to do the opposite operation, "asciify" your strings, and enter them into the database as an additional column (with its own index!); then "asciify" your query, and watch it work. Even better, see if your "asciified" query is the same as the original; if it is, use the "asciified" column (as the user entered ASCII characters only); if they differ, then the user entered Polish-specific characters, and would presumably enter them all correctly, so use the original column. This way, if the user enters "Żabczyński", you search for "Żabczyński" in the original column and find it there. If the user enters "Zabczynski", assume it might be asciified, and search in the asciified column; it would find "Żabczyński", "Zabczyński", "Żabczynski" and "Zabczynski" if they were there. If the user enters "Zabczyński" or "Żabczynski", presumably they should know Polish, so search in the original column and return no results. All this win comes at the expense of only storing one more copy of your column.

Amadan
  • 191,408
  • 23
  • 240
  • 301
0

Move to MySQL or Postgres. SQLite has its limits.

Michał Leon
  • 2,108
  • 1
  • 15
  • 15