20

I want to make a tags column of type json:

e.g.,

id  |  tags
=========================================
1   |  '["tag1", "tag2", "tag3"]'
2   |  '["tag1", "tag3", "tag5", "tag7"]'
3   |  '["tag2", "tag5"]'

I want to index each tag in the arrays, without knowing the length of the arrays (variable length).

So then if I query for rows that contain tag2, it should return rows 1, 3.

https://dev.mysql.com/doc/refman/5.7/en/json.html

JSON columns cannot be indexed. You can work around this restriction by creating an index on a generated column that extracts a scalar value from the JSON column

By "extracts a scalar value", does this mean I must extract & index each item in the arrays individually (meaning I must know the maximum length of the array to index them all)? How do I index a variable length array?

Community
  • 1
  • 1
Sunny
  • 2,232
  • 7
  • 24
  • 36
  • Nope, a scalar is just a simple value (vs. complex data structures). – Álvaro González Feb 04 '16 at 10:17
  • 2
    If you want to do something with the data in mysql, then don't store it as json. Store it as normalized data, and spit it out as json as and when required. – Strawberry Feb 04 '16 at 10:21
  • Strawberry's comment highlights an important point. For instance, are `'["tag2", "tag5"]'` and `'["tag5", "tag2"]'` the same piece of data or not? – Álvaro González Feb 04 '16 at 10:23
  • Let's say I make a query for `tag2`, then it should match any array containing `tag2`, regardless where it is in the array. – Sunny Feb 04 '16 at 10:25

3 Answers3

14

It's now possible with MySQL 8.0.17+

Something like this (not tested)

CREATE TABLE posts (
    id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    tags JSON,
    INDEX tags( (CAST(tags AS VARCHAR(32) ARRAY)) )
    );

Use it this way:

SELECT * FROM posts 
         WHERE JSON_CONTAINS(tags, CAST('[tag1, tag2]' AS JSON));

More details and samples here: https://dev.mysql.com/doc/refman/8.0/en/json.html

hjahan
  • 370
  • 3
  • 13
  • 1
    This is great! Just a note that this is only available with MySQL 8.0.17+ (unfortunately we use AWS RDS, which currently supports to 8.0.16, of course). "Beginning with MySQL 8.0.17, the expr for a key_part specification can take the form (CAST json_path AS type ARRAY) to create a multi-valued index on a JSON column." from: https://dev.mysql.com/doc/refman/8.0/en/create-index.html – Bob Dankert Dec 05 '19 at 17:29
  • 2
    Hi, is it possible to create multi-valued index on array of strings. I tried something like: CREATE INDEX products ON components ( (CAST(data->'$.products' AS VARCHAR(255) ARRAY)) ) but it gives me error. Is there any way to accomplish this? – iceDice Mar 13 '20 at 19:26
  • Yes, it is also possible to create index for that kind of fields. Check out here https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-multi-valued – hjahan Mar 15 '20 at 08:55
  • Can you share some example? I have json with "products" field that contains array of strings which are variable length: ["aa", "bbb", "cccc" ...] – iceDice Mar 15 '20 at 11:55
  • This should works: CREATE INDEX products ON YOURTABLE ( (CAST(YOURJSONFIELD->'$.products' AS UNSIGNED ARRAY)) ); – hjahan Mar 15 '20 at 12:23
  • Nope, i just tried, it throws error "Invalid JSON value for CAST for functional index 'products'" – iceDice Mar 15 '20 at 12:26
  • 2
    `VARCHAR` is not supported, should be `CHAR(32)`, https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_cast – Enix Jun 16 '22 at 02:23
7

By "extracts a scalar value", does this mean I must extract & index each item in the arrays individually [...]?

You can extract as many items as you want. They will be stored as scalars (e.g. string), rather than as compound values (which JSON is).

CREATE TABLE mytags (
    id INT NOT NULL AUTO_INCREMENT,
    tags JSON,
    PRIMARY KEY (id)
);

INSERT INTO mytags (tags) VALUES
    ('["tag1", "tag2", "tag3"]'),
    ('["tag1", "tag3", "tag5", "tag7"]'),
    ('["tag2", "tag5"]');

SELECT * FROM mytags;

+----+----------------------------------+
| id | tags                             |
+----+----------------------------------+
|  1 | ["tag1", "tag2", "tag3"]         |
|  2 | ["tag1", "tag3", "tag5", "tag7"] |
|  3 | ["tag2", "tag5"]                 |
+----+----------------------------------+

Let's create an index with one item only (first value from the JSON object):

ALTER TABLE mytags
    ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (json_extract(tags, '$[0]')),
    ADD INDEX tags_index (tags_scalar);

SELECT * FROM mytags;

+----+----------------------------------+-------------+
| id | tags                             | tags_scalar |
+----+----------------------------------+-------------+
|  1 | ["tag1", "tag2", "tag3"]         | "tag1"      |
|  2 | ["tag1", "tag3", "tag5", "tag7"] | "tag1"      |
|  3 | ["tag2", "tag5"]                 | "tag2"      |
+----+----------------------------------+-------------+

Now you have an index on the VARCHAR column tags_scalar. The value contains quotes, which can also be skipped:

ALTER TABLE mytags DROP COLUMN tags_scalar, DROP INDEX tags_index;

ALTER TABLE mytags
    ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (json_unquote(json_extract(tags, '$[0]'))),
    ADD INDEX tags_index (tags_scalar);

SELECT * FROM mytags;

+----+----------------------------------+-------------+
| id | tags                             | tags_scalar |
+----+----------------------------------+-------------+
|  1 | ["tag1", "tag2", "tag3"]         | tag1        |
|  2 | ["tag1", "tag3", "tag5", "tag7"] | tag1        |
|  3 | ["tag2", "tag5"]                 | tag2        |
+----+----------------------------------+-------------+

As you can already imagine, the generated column can include more items from the JSON:

ALTER TABLE mytags DROP COLUMN tags_scalar, DROP INDEX tags_index;

ALTER TABLE mytags
    ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (json_extract(tags, '$[0]', '$[1]', '$[2]')),
    ADD INDEX tags_index (tags_scalar);

SELECT * from mytags;

+----+----------------------------------+--------------------------+
| id | tags                             | tags_scalar              |
+----+----------------------------------+--------------------------+
|  1 | ["tag1", "tag2", "tag3"]         | ["tag1", "tag2", "tag3"] |
|  2 | ["tag1", "tag3", "tag5", "tag7"] | ["tag1", "tag3", "tag5"] |
|  3 | ["tag2", "tag5"]                 | ["tag2", "tag5"]         |
+----+----------------------------------+--------------------------+

or use any other valid expression to auto-generate a string out of the JSON structure, in order to obtain something that can be easily indexed and searched like "tag1tag3tag5tag7".

[...](meaning I must know the maximum length of the array to index them all)?

As explained above, you don't need to know - NULL values can be skipped by using any valid expression. But of course it's always better to know.
Now there's the architecture decision: Is JSON data type the most appropriate to achieve the goal? To solve this particular problem? Is JSON the right tool here? Is it going to speed up searching?

How do I index a variable length array?

If you insist, cast string:

ALTER TABLE mytags DROP COLUMN tags_scalar, DROP INDEX tags_index;

ALTER TABLE mytags
    ADD COLUMN tags_scalar VARCHAR(255) GENERATED ALWAYS AS (replace(replace(replace(cast(tags as char), '"', ''), '[', ''), ']', '')),
    ADD INDEX tags_index (tags_scalar);

SELECT * from mytags;

+----+----------------------------------+------------------------+
| id | tags                             | tags_scalar            |
+----+----------------------------------+------------------------+
|  1 | ["tag1", "tag2", "tag3"]         | tag1, tag2, tag3       |
|  2 | ["tag1", "tag3", "tag5", "tag7"] | tag1, tag3, tag5, tag7 |
|  3 | ["tag2", "tag5"]                 | tag2, tag5             |
+----+----------------------------------+------------------------+

This way or another you end up with a VARCHAR or TEXT column, where you apply the most applicable index structure (some options).

Further reading:

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Taz
  • 3,718
  • 2
  • 37
  • 59
7

It's not practical to index an array in JSON in MySQL.

You can use generated columns to extract each element of the array into a separate scalar column, and index each generated column. But how many of these columns will you need? How will you know which column contains the value you're searching for?

You can use a generated column as @bostaf's answer shows, extracting multiple array values and making a comma-separated string. You can't use an plain index to search this string for a word that may occur in the middle. Perhaps you could use a fulltext index, but this only works if the array elements are each a single word.

In April 2018, I gave a presentation about this kind of weakness using JSON in MySQL: How to Use JSON in MySQL Wrong.

The better solution for multi-valued attributes is to store them in a dependent table, in the manner proscribed by database normalization. Then the values appear over multiple rows, in a single column, which you can index in a more straightforward way.


Re your comment:

I came up with a solution for enforcing uniqueness on a JSON array, but it depends on the elements of the array staying in sorted order.

mysql> create table mytags ( tags json );
mysql> insert into mytags values ('["tag1", "tag3", "tag5", "tag7"]');

The JSON_UNQUOTE() function returns the JSON as a string.

mysql> select json_unquote(tags) from mytags;
+----------------------------------+
| json_unquote(tags)               |
+----------------------------------+
| ["tag1", "tag3", "tag5", "tag7"] |
+----------------------------------+

Now we know how to make a generated column based on that, and then a UNIQUE KEY index on the generated column. This works in MySQL 5.7 and later.

mysql> alter table mytags 
  add column j varchar(768) as (json_unquote(tags)), 
  add unique index (j);

Now trying to insert the same array of values in the JSON column fails:

mysql> insert into mytags (tags) values ('["tag1", "tag3", "tag5", "tag7"]');
ERROR 1062 (23000): Duplicate entry '["tag1", "tag3", "tag5", "tag7"]' for key 'j'

Unfortunately, there's no good way to ensure that the JSON array is sorted. See Sorted json array field So it's up to you to design your application code so that it always pre-sorts the values in the JSON array before inserting or updating.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • For the last para of your answer, how can we apply unique constraint across various rows (storing the attributes). Eg: There is a word table (cat, dog, mouse). And there is a phrase table which store unique combination of these words, eg: (cat dog, mouse dog). To ensure that a phrase (combination of words) is unique, we can store this combination of FKs (words) in a json, but then we cannot index this json, in case we want to Join phrase to keywords. Do we have a solution for this kinda problem ? – Madhur Bhaiya Dec 26 '19 at 13:30