-2

I have data in the following format:

+---------+---------+----------+-----------+-----------+-----------+
|    id   |  title  |  author  | keyword_1 | keyword_2 | keyword_3 |     
+---------+---------+----------+-----------+-----------+-----------+

I am looking to store it in a database so I can search by title, keyword_1, keyword_2, or keyword_3.
An example would be

+---------+------------------+-----------+-------------+-------------+-----------+
|    id   |  title           | author    |   keyword_1 | keyword_2   | keyword_3 |     
+---------+------------------+-----------+-------------+-------------+-----------+
|    123  |  Learn Java 101  | John Doe  |   java      | programming | software  |     
+---------+------------------+-----------+-------------+-------------+-----------+

On the front end, there is a form where the user inputs a title and/or keywords. The database needs to be queried for this information. But the user input will likely not be an exact match, so we need to do some kind of regex or fuzzy matching. The user payload may input something like:

{
    title: "Learn Java",
    author: "Jon Doee",
    keyword1: "computers",
    keyword2: "softwar",
    keyword3: null,

}

I realize there are some built-in operations, for example, in Postgres we have LIKE and Levenshtein(). However, I'm not sure if this is the right approach. It seems like a very expensive operation to compare a keyword with all three columns.

Surely there must be a clean way to do this. I am posting here because I want to check whether this is or is not the path that I should go down.

From an architectural standpoint is this the correct way to store the data? I thought about using a document-based system and I'm not sure that that would be much better or worse.

I'm somewhat new to all this and would appreciate some guidance on what is recommended. Thanks!

Jac Frall
  • 403
  • 7
  • 15

2 Answers2

2

I would start with a normalized relational model:

Books:

|    id   |  title           | author    | 
|    123  |  Learn Java 101  | John Doe  |

Then:

BookKeywords

|    book_id   |  Keyword     |
|    123       |  java        |
|    123       |  programming |
|    123       |  software    |

One particularly valuable feature of this data model is that you can have a Keywords table and validate that only valid keywords go into this table.

This is the "normal" way to store multiple values per entity.

After you have mastered this, you can think about alternative structures. For instance:

  • Storing the keywords as a text field and using text search can work well under some circumstances.
  • Storing the keywords as an array can work well under some circumstances.
  • Storing the keywords in JSON can work well under some circumstances.

But start with what the SQL language was designed to support -- separate entities in tables.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • That makes sense. I see one slight issue that I can't think of a clear workaround. How then, would I query the `book_id where keyword="java" and keyword="programming"`, etc. It seems like it would default to an `OR` statement since all the keywords are on separate lines – Jac Frall Aug 19 '20 at 12:29
  • @JacFrall . . . It does default to `or`. You need to use either two comparisons (`where exists (. . . = 'java') and exists (. . . = 'programming')` ) or aggregation. – Gordon Linoff Aug 19 '20 at 14:11
1

When you are using RDBMS and you have clear idea what information you will be storing, why do you prefer storing in document. In RDBMS, one generally uses datatypes like json, xml, etc when information isn't relational or purpose is just storing and retrieving and there are least modification. Looking at your table, relational method will always give you faster results as compared to document method when dealing with huge data.

Yes, like operations are bit expensive and alternate is REGEXP or SIMILAR TO(for Postgres). You should know where to use what. You can always create pattern matching index on the columns which you are going to use in where clause. GIN/GIST index for column where more than 2 words are being stored. ex:Title

If there are continuous updates or delete being performed, consider performing maintenance operations on table by setting correct vacuum parameters, analysing table, index rebuild/recreate.

If there are millions of records being stored, use table partitioning.

Your requirement is pretty decent and I don't see any need of storing in document over here.

Nik
  • 204
  • 1
  • 7
  • 18