0

I am looking for a way to store a big amount of data in the file or files. The additional requirement is: it should be indexed, two indexes on integer fields should allow selecting a specific set of data very fast.

Details: the data record is a fixed-length set of 3 integers like this:

A (int) | B (int) | N (int)

A and B are indexable columns while N is just a data value.

This data set may contain billions of records (for example 30M) and there should be a way to select all records with A= as fast as possible. Or records with B= as fast as possible.

I can not use any other technologies except MySQL and PHP and you can say: "Wow, you can use MySQL!". Sure. I am already using it, but because of MySQL's extra data, my database takes 10 times more space than it should, plus index data.

So I am looking for a file-based solution.

Are there any ready algorithms to implement this? Or source code solution?

Thank you!

Update 1:

CREATE TABLE `w_vectors` (
    `wid` int(11) NOT NULL,
    `did` int(11) NOT NULL,
    `wn` int(11) NOT NULL DEFAULT '0',
    UNIQUE KEY `did_wn` (`did`,`wn`),
    KEY `wid` (`wid`),
    KEY `did` (`did`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci

Update 2:

The goal of this table is to store document-vs-words vectors for a word-based search application. This table stores all the words from all the documents in compact form (wid is the word ID from the word vocabulary, did is the document ID and wn is the number of the word in the document). This works pretty well, however, in case you have, let's say, 1000000 documents, each document contains average of 10k words, this table becomes VERY VERY huge like 10 billion rows! And with row size 34 bytes it becomes a 340 Gb structure for just 1 million documents... not good, right?

I am looking for a way to optimize this.

Epsiloncool
  • 1,435
  • 16
  • 39
  • Have you try SQLite? It is a minimalistic file-based MySQL. Then you could have the benefit of built-in data indexing and a query language on top. – Dat Pham Feb 11 '21 at 21:57

2 Answers2

1

If you must use MySQL, you could try:

  • Convert the table to MyISAM, which takes less space than InnoDB, and allows multiple indexes per table. I rarely recommend MyISAM because it doesn't support ACID properties. But if your option is to use a file-based solution, that won't support ACID either.

  • Use one of the various solutions for compressed data in MySQL. There's a nice comparison here: https://www.percona.com/blog/2018/11/23/compression-options-in-mysql-part-1/

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
1

You may as well change

UNIQUE KEY `did_wn` (`did`,`wn`)

to

PRIMARY KEY(did, wn)

and get rid of

INDEX(did)

since that composite index takes care of queries to did.

With that PK, these will be very efficient:

... WHERE did = 123
... WHERE did = 123 AND wn = 456
... WHERE wn = 456 AND did = 123

Meanwhile, your INDEX(wid) benefits any WHERE clause that tests for a single value of wid or a range of wids.

Since I don't know about your original A and B, I can't answer your question in terms of the real column names. Anyway:

there should be a way to select all records with A= as fast as possible. Or records with B= as fast as possible.

For those, you need

INDEX(A)  -- or any index _starting with_ A
INDEX(B)  -- or any index _starting with_ B

But if either of those is did, don't add it. (The PK will take care of making it fast.

Also, use InnoDB, not MyISAM. Alas, that leads to "10 times more space than it should" in your case. If you choose to use MyISAM, I will need to start over on index recommendations.

Once you map A and B to the column names, I'll give you one more tip.

More discussion of indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Hi, I have changed indexes as you recommended, and yes, the Index Length was decreased sufficiently (almost in 2 times). However, the Data Length was not changed and even increased slightly. The ideal row size for my type of data should be 12 bytes (3x ints) but in the reality, the row size was 34 bytes (approx) and after index modifications, it become 37 bytes. So it's 3x size from ideal. – Epsiloncool Feb 15 '21 at 01:34
  • But your index optimization idea is very good. I definitely will implement this, in case I decide to leave MySQL here for this task. However, I still need to try to find the proper solution... – Epsiloncool Feb 15 '21 at 01:36
  • I also updated my question (see Update 2) to let you better understand my needs. – Epsiloncool Feb 15 '21 at 01:47
  • @Epsiloncool - Consider a `FULLTEXT` index -- It will do all the work and be _very_ efficient. By that, I mean much more efficient than your lookup table. – Rick James Feb 15 '21 at 07:21
  • The solution we are building is actually the replacement for MySQL's FULLTEXT index. It's not flexible and has more limits which we are trying to beat with this solution. – Epsiloncool Feb 15 '21 at 11:24