How can I make "euclidean distance calculation" faster in MySQL?

Question

I am creating a face recognition system, but the search is very slow. Can you share how to speed up the search?

It takes about 6 seconds for 100,000 data items.

MySQL

mysql> SHOW VARIABLES LIKE '%version%';
+--------------------------+------------------------------+
| Variable_name            | Value                        |
+--------------------------+------------------------------+
| version                  | 8.0.29                       |
| version_comment          | MySQL Community Server - GPL |
| version_compile_machine  | x86_64                       |
| version_compile_os       | Linux                        |
+--------------------------+------------------------------+

Table

CREATE TABLE `face_feature` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `f1` decimal(9,8) NOT NULL,
  `f2` decimal(9,8) NOT NULL,
  ...
  ...
  `f127` decimal(9,8) NOT NULL,
  `f128` decimal(9,8) NOT NULL,
  PRIMARY KEY (id)
);

Data

mysql> SELECT count(*) FROM face_feature;
+----------+
| count(*) |
+----------+
|   110004 |
+----------+

mysql> SELECT * FROM face_feature LIMIT 1\G;
  id: 1
  f1: -0.07603023
  f2: 0.13605964
  ...
f127: 0.09608927
f128: 0.00082345

SQL

SELECT 
  id,
  sqrt(
    power(f1 - (-0.09077361), 2) +
    power(f2 - (0.10373443), 2) +
    ...
    ...
    power(f127 - (0.0778369), 2) +
    power(f128 - (0.00951046), 2)
  ) as distance
FROM 
  face_feature
ORDER BY 
  distance
LIMIT
  1
;

Result

+----+--------------------+
| id | distance           |
+----+--------------------+
|  1 | 0.3376853491771237 |
+----+--------------------+
1 row in set (6.18 sec)

Update 1:

Changed from decimal(9,8) to float(9,8)

Then, improved from approximately 4sec to 3.26 sec

mysql> desc face_feature;
+-------+------------+------+-----+---------+----------------+
| Field | Type       | Null | Key | Default | Extra          |
+-------+------------+------+-----+---------+----------------+
| id    | int        | NO   | PRI | NULL    | auto_increment |
| f1    | float(9,8) | NO   |     | NULL    |                |
| f2    | float(9,8) | NO   |     | NULL    |                |
..
| f127  | float(9,8) | NO   |     | NULL    |                |
| f128  | float(9,8) | NO   |     | NULL    |                |
+-------+------------+------+-----+---------+----------------+

Update 2:

Changed from POWER(z, 2) to z*z

Then, the result was changed from 3.26 sec to 4.65 sec

SELECT 
  id,
  sqrt(
    ((f1 - (-0.09077361)) * (f1 - (-0.09077361))) +
    ((f2 - (0.10373443)) * (f2 - (0.10373443))) +
    ((f3 - (0.00798536)) * (f3 - (0.00798536))) +
    ...
    ...
    ((f126 - (0.07803915)) * (f126 - (0.07803915))) +
    ((f127 - (0.0778369)) * (f127 - (0.0778369))) +
    ((f128 - (0.00951046)) * (f128 - (0.00951046))
  ) as distance
FROM 
  face_feature
ORDER BY 
  distance
LIMIT
  1
;

Update 3

I am looking into the usage of MySQL GIS.

How can I migrate from "float" to "points" in MySQL?

Update 4

I'm also looking at PostgreSQL because I can't find a way to handle 128 dimensions in MySQL.

Do you need the `SQRT()`? Depending on what you are using with the value, you can "skip" the `SQRT()` call on "both sides" and the check/result will be the same. — Progman, Apr 29 '22 at 18:03
By using PostGiS (and a proper data format) you could probably increase the speed like 100000 times - just a side comment — Flash Thunder, Apr 30 '22 at 13:02
@FlashThunder Thank you. I'm looking into PostGiS right now. Is it different from "POINTS and a SPATIAL" that Rick James taught me? I am also looking into "POINTS and a SPATIAL". — zono, Apr 30 '22 at 13:14
I found that PostGIS is a spatial database extender for PostgreSQL object-relational database. — zono, Apr 30 '22 at 13:20
@zono yes it is, it has such an operations implemented natively, would need to change the database software and way of storing (that's why I said it's a side comment, not a an answer at all, but it's not that hard), but maybe in your case it's worth it, as it should work really really fast... of course 100000 times was an approximation, but could be even more. — Flash Thunder, Apr 30 '22 at 13:34
@FlashThunder Thank you for your valuable answer. I could choose something other than MySQL. Let me check. — zono, Apr 30 '22 at 13:41
I really don’t understand why you need the distance of more then 100.000 locations exactly. Why don’t you do some kind of pre filtering by just getting the distance by simple pythagoras and then calculate the correct distance for only the result which are in an interesting range? — Thallius, Apr 30 '22 at 13:55
@Thallius Thank you for your comment. The data is the 128 features of the face. As you say, I may be able to shorten the calculation time by narrowing down the dimensions, but I don't know which data (feature) greatly affects the accuracy of face recognition. (Facial recognition experts may know). Therefore, all data is calculated. — zono, Apr 30 '22 at 14:07
Is it 128 features on 1000 different faces, hence about 100K?? — Rick James, Apr 30 '22 at 14:20
If you find the closest feature on one face, would that same feature be the closest on all (or most) of the other faces? — Rick James, Apr 30 '22 at 14:26
@RickJames I'm not sure if I can respond correctly with just text, so I'll attach a link. Compare the corresponding feature values, and if the value is the closest as a whole, it is judged as the same face. https://github.com/ageitgey/face_recognition/issues/238 — zono, Apr 30 '22 at 14:33
@RickJames "SPATIAL is built into MySQL." Thank you. I'm watching a MySQL GIS tutorial video. I'm trying to figure out if I can handle 128 dimensions. — zono, Apr 30 '22 at 14:35
@zono - Hmmm... What's the query? "which face has point 76 closest to "[.0345,.0654]"? Or ""which face has _some_ point closest to "[.0345,.0654]"? Or "which _point_ on face 98765 is closest to "[.0345,.0654]". (I'm having trouble understanding the goal, hence what SQL to write.) — Rick James, Apr 30 '22 at 14:46
@RickJames It might be easier to communicate if I write the whole SQL. Hold on, please. — zono, Apr 30 '22 at 14:51
@RickJames The following SQL will give me the results I expect. Is this the answer you expect? https://gist.github.com/zono/9b7c68964639dfaa8972fa9a67c99f4a — zono, Apr 30 '22 at 14:56
@zono - Thanks. How do you you compensate for (or prevent) a face being enlarged or rotated or shifted? Do you start by doing some kind of "normalization" of the 128 numbers for each of the 100K faces? — Rick James, Apr 30 '22 at 15:22
@zono - Hence, I withdraw SPATIAL from consideration. Most of my Answer still stands. — Rick James, Apr 30 '22 at 15:25
@RickJames Thank you for your reply. Your point is correct. The data registered in the database has already been normalized. The processes are following: Face image file - > correction or normalization (fix face size, etc.) - > feature extraction - > register to DB — zono, Apr 30 '22 at 15:32

Rick James · Accepted Answer · 2022-04-30T15:29:20.610

DECIMAL(9,8) -- that's a lot of significant digits. Do you need that much precision?
FLOAT -- about 7 significant digits; faster arithmetic.
POWER(z, 2) -- probably a lot slower than z*z. (This may be the slowest part.)

SQRT -- In many situations, you can simply work with the squares. In this case:

SELECT SQRT(closest)
    FROM ( SELECT -- leave out SQRT
             ... ORDER BY .. LIMIT 1 )

Here are some other thoughts. They are not necessarily relevant to the query being discussed:

Precise testing -- Beware of comparing for 'equal' Roundoff error is likely to make things unequal unexpectedly. Imprecise measurements add to the issue. If I measure something twice, I might get 1.23456789 one time and 1.23456788 the next time. (Especially at that level of "precision".
Trade complexity vs speed -- Use ABS(a - b) as the distance formula; find the 10 items closest in that way, then use the Euclidean distance to get the 'right' distance.
Break the face into regions. Find which region the item is in, then check only the subset of the 128 points that are in that region. (Being near a boundary -- put some points in two regions.)
Think out of the box -- I'm not familiar with your facial recognition, so I have run out of mathematical tricks.
Switch to POINTs and a SPATIAL index. It may be possible your task orders of magnitude faster. (This is probably not practical for 128-dimensional space.)

How can I make "euclidean distance calculation" faster in MySQL?

1 Answers1

Linked