-1

I have a large SQLite database (~134 GB) that has multiple tables each with 14 columns, about 330 million records, and 4 indexes. The only operation used on the database is "Select *" as I need all the columns(No inserts or updates). When I query the database, the response time is slow when the result set is big (takes 160 seconds for getting ~18,000 records).

I have improved the use of indexes multiple times and this is the fastest response time I got.

I am running the database as a back-end database for a web application on a server with 32 GB of RAM.

is there a way to use RAM (or anything else) to speed up the query process?

Here is the code that performs the query.

async.each(proteins,function(item, callback) {

   `PI[item] = [];                      // Stores interaction proteins for all query proteins
    PS[item] = [];                      // Stores scores for all interaction proteins
    PIS[item] = [];                     // Stores interaction sites for all interaction proteins
    var sites = {};                     // a temporarily holder for interaction sites

var query_string = 'SELECT * FROM ' + organism + PIPE_output_table +
        ' WHERE ' + score_type + ' > ' + cutoff['range'] + ' AND (protein_A = "' + item + '" OR protein_B = "' + item '") ORDER BY PIPE_score DESC';

db.each(query_string, function (err, row) {

        if (row.protein_A == item) {
            PI[item].push(row.protein_B);

            // add 1 to interaction sites to represent sites starting from 1 not from 0
            sites['S1AS'] = row.site1_A_start + 1;
            sites['S1AE'] = row.site1_A_end + 1;
            sites['S1BS'] = row.site1_B_start + 1;
            sites['S1BE'] = row.site1_B_end + 1;

            sites['S2AS'] = row.site2_A_start + 1;
            sites['S2AE'] = row.site2_A_end + 1;
            sites['S2BS'] = row.site2_B_start + 1;
            sites['S2BE'] = row.site2_B_end + 1;

            sites['S3AS'] = row.site3_A_start + 1;
            sites['S3AE'] = row.site3_A_end + 1;
            sites['S3BS'] = row.site3_B_start + 1;
            sites['S3BE'] = row.site3_B_end + 1;

            PIS[item].push(sites);
            sites = {};
        }
}
  • You may be dropping a lot of time just sending 18K records over the network. Have you done any benchmarking on the query itself? – Tim Biegeleisen Dec 10 '15 at 08:05
  • Could you please show a sample of the table schema, and a full code example that takes the 160 seconds? Not just a query, but the code to fetch the results. I suspect the records are very large and you're trying to fetch them all at once causing network delays and slow memory swapping. – Schwern Dec 10 '15 at 08:46
  • Indexes don't help when you read everything. But what do you need *all* the data for? – CL. Dec 10 '15 at 08:53
  • @TimBiegeleisen Yep, the time (160 seconds) reflects the actual query time excluding any other factors (i.e network transfer). – Haitham Fallatah Dec 10 '15 at 21:39
  • @CL. the tables contain protein Data. I need all the information because I need to display all data related to the protein the user selected in a web page. However, only two columns are used in the 'WHERE' clause and hence theses two columns are the only columns with indexes on them. – Haitham Fallatah Dec 10 '15 at 21:45
  • 1
    @Schwern I have updated my post with the code – Haitham Fallatah Dec 10 '15 at 22:04
  • 1
    There's no way to answer this question without having: 1. The index-defining query included as well (think of all the queries needed to re-establish your indices), 2. The results of `ANALYZE` on the table and each of the indices used in the query, and 3. The results of `EXPLAIN` on the select query. – Kuba hasn't forgotten Monica Dec 10 '15 at 22:09
  • In the `sqlite3` command-line shell, the database schema can be shown with the `.schema` command. In SQLite, use not `EXPLAIN` but `EXPLAIN QUERY PLAN`. – CL. Dec 10 '15 at 22:16
  • This looks like Javascript. What SQL library are you using? – Schwern Dec 10 '15 at 22:24
  • @Schwern : Yep you are right, it is node-sqlite3. – Haitham Fallatah Dec 16 '15 at 03:37
  • @CL. Thanks for the tip about EXPLAIN QUERY PLAN, I was able to solve the problem using it. – Haitham Fallatah Dec 16 '15 at 03:38
  • @KubaOber: Thanks for your help. I was able to find a solution using the tips you provided. – Haitham Fallatah Dec 16 '15 at 03:39

1 Answers1

0

The query you posted uses no variables.

It will always return the same thing: all the rows with a null score whose protein column is equal to its protein_a or protein_b column. You're then having to filter all those extra rows in Javascript, fetching a lot more rows than you need to.

Here's why...


If I'm understanding this query correctly, you have WHERE Score > [Score]. I've never encountered this syntax before, so I looked it up.

[keyword] A keyword enclosed in square brackets is an identifier. This is not standard SQL. This quoting mechanism is used by MS Access and SQL Server and is included in SQLite for compatibility.

An identifier is something like a column or table name, not a variable.

This means that this...

SELECT * FROM [TABLE]
WHERE Score > [Score] AND
      (protein_A = [Protein] OR protein_B = [Protein])
ORDER BY [Score] DESC;

Is the same as this...

SELECT * FROM `TABLE`
WHERE Score > Score AND
      (protein_A = Protein OR protein_B = Protein)
ORDER BY Score DESC;

You never pass any variables to the query. It will always return the same thing.

This can be seen here when you run it.

db.each(query_string, function (err, row) {

Since you're checking that each protein is equal to itself (or something very like itself), you're likely fetching every row. And it's why you have to filter all the rows again. And that is one of the reasons why your query is so slow.

    if (row.protein_A == item) {

BUT! WHERE Score > [Score] will never be true, a thing cannot be greater than itself except for null! Trinary logic is weird. So only if Score is null can that be true.

So you're returning all the rows whose score is null and the protein column is equal to protein_a or protein_b. This is a lot more rows than you need, I guess you have a lot of rows with null scores.


Your query should incorporate variables (I'm assuming you're using node-sqlite3) and pass in their values when you execute the query.

var query = "                                              \
    SELECT * FROM `TABLE`                                  \
    WHERE Score > $score AND                               \
          (protein_A = $protein OR protein_B = $protein)   \
    ORDER BY Score DESC;                                   \
";
var stmt = db.prepare(query);
stmt.each({$score: score, $protein: protein}, function (err, row) {
    PI[item].push(row.protein_B);
    ...
});
Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Sorry for throwing you off with the code. I had actually modified the code with brackets when I posted it here to make it easier to read. (I've updated the code above). The databases doesn't contain any null values because the data is parsed in from a file that is generated by an algorithm which doesn't produce null values. regardless, I have been able to identify where the problem was after a lot of testing. it was because I am pulling all columns (21) and the indexes only cover two of them. I had to modify the design of the database to have much fewer columns and have a covering index. – Haitham Fallatah Dec 16 '15 at 03:34
  • Thank you so much for you help. and sorry for not being able to reply sooner. – Haitham Fallatah Dec 16 '15 at 03:35
  • @HaithamFallatah Well, it was a fascinating answer to the wrong question. :) And your code is vulnerable to a [SQL Injection Attack](https://en.wikipedia.org/wiki/SQL_injection). Please use [placeholders in a prepared statement](https://en.wikipedia.org/wiki/Prepared_statement). Also the `if (row.protein_A == item)` means you're throwing out half the `WHERE protein_A = item OR protein_B = item` rows you're fetching. Check only `WHERE protein_A = item` and drop the `if (row.protein_A == item)`. – Schwern Dec 16 '15 at 04:19
  • there is actually an else statement that deals with the other half and I left it out since it is almost identical to the code above and I thought it is not relevant to the problem. and You're right regarding the SQL injecting, But I actually tried preparing the query and and use place holders (just like you showed above) but it doesn't seem to be recognized for some reason. I have posted a question earlier about this found here [link](http://stackoverflow.com/questions/34303706/binding-variables-to-sql-statements-using-node-sqlite3) – Haitham Fallatah Dec 16 '15 at 05:32
  • The reason why I use the if statement above is because protein_A could either be an interacting protein or a query protein (the one the user gave as input represented as 'item'). so the if statement is saying "if protein_A is the protein the user provided, get protein_B and label it as interacting protein" and the else statement does the opposite. – Haitham Fallatah Dec 16 '15 at 05:43