Can I use InnoDB full text search to tokenize documents?

Question

I used to think that SQL cannot process unstructured data (like text) unless we write some user-defined functions in C. However, InnoDB's FullText Search feature seems did much of such work already.

According to https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-index.html, the index is saved in InnoDB tables named FTS_00000..._00000..._INDEX_?.

I tried to run SELECT * FROM FTS_00000..._00000..._INDEX_1, in the hope to see tokens in each document (perhaps with stopwords removed already). However, I got an error message

ERROR 1146 (42S02): Table 'tf.FTS_0000000000000028_0000000000000030_INDEX_1' doesn't exist

even if select * from information_schema.INNODB_SYS_TABLES; reveals that the table exists.

Does anyone know how I could get the tokens of each document I inserted into the full-text index? It would be great if I can get the information in the following data schema:

token_id   document_id   count
"apple"    103343        3
"orange"    9593         1
...

score 0 · Answer 1 · answered Feb 02 '20 at 18:49

Just because InnoDB uses a table as an internal data structure doesn't mean you have access to query those FTS tables with SQL statements. They don't appear in INFORMATION_SCHEMA.TABLES.

After creating the table opening_lines which is the example given in that manual page, I see this:

mysql> SELECT table_id, name, space FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES
    ->        WHERE name LIKE 'test/%';
+----------+----------------------------------------------------+-------+
| table_id | name                                               | space |
+----------+----------------------------------------------------+-------+
|       52 | test/FTS_000000000000002e_0000000000000085_INDEX_1 |    36 |
|       53 | test/FTS_000000000000002e_0000000000000085_INDEX_2 |    37 |
|       54 | test/FTS_000000000000002e_0000000000000085_INDEX_3 |    38 |
|       55 | test/FTS_000000000000002e_0000000000000085_INDEX_4 |    39 |
|       56 | test/FTS_000000000000002e_0000000000000085_INDEX_5 |    40 |
|       57 | test/FTS_000000000000002e_0000000000000085_INDEX_6 |    41 |
|       47 | test/FTS_000000000000002e_BEING_DELETED            |    31 |
|       48 | test/FTS_000000000000002e_BEING_DELETED_CACHE      |    32 |
|       49 | test/FTS_000000000000002e_CONFIG                   |    33 |
|       50 | test/FTS_000000000000002e_DELETED                  |    34 |
|       51 | test/FTS_000000000000002e_DELETED_CACHE            |    35 |
|       46 | test/opening_lines                                 |    30 |
+----------+----------------------------------------------------+-------+
12 rows in set (0.00 sec)

mysql> SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA='test';
+---------------+
| TABLE_NAME    |
+---------------+
| opening_lines |
+---------------+
1 rows in set (0.00 sec)

As far as I know, there's no way to query the FTS tables directly at all. They are only for InnoDB's internal implementation of fulltext indexing.

Can I use InnoDB full text search to tokenize documents?

1 Answers1