1

I am working in the spatialite-gui with a sqlite db. I have a large table (80,000,000 records) and I want to make a table that includes a code from another table, which has about 48,000 records.

I wrote the following code and tested it on the first 80k records of my big table; it took about 1:25 to run. When I ran it on the big table it went for 100 hours and I ended up cancelling with no result.

What can I do to improve query performance? I have an index on longitude but not on latitude; neither field is unique. Both lut.pointgeom and lut.gridgeom are BLOB fields and not strictly necessary.

CREATE TABLE policy_gcode AS
SELECT 
  p.*,
  lut.gcode,
  lut.pointgeom,
  lut.gridgeom
FROM
  allpol AS p
  JOIN policylutgrid AS lut 
    ON  p.latitude = lut.latitude 
    AND p.longitude = lut.longitude;
GMB
  • 216,147
  • 25
  • 84
  • 135
EnKay
  • 48
  • 6

2 Answers2

2

Before all, you want a sqlite multicolumn indice on each table with both latitude and longitude fields, ordered in the same way as you join them. This should allow your database engine to optimize the JOIN operation that occurs within your query.

CREATE INDEX allpol_idx ON allpol(latitude, longitude);
CREATE INDEX policylutgrid_idx ON policylutgrid(latitude, longitude);

It is also a good idea to only SELECT the fields that you really need : if you don’t need the BLOBs (or any other field), don’t ask for it. This will reduce the amount of data that your database must handle.

If theses measures do not solve the issue, then you have to look at other parts of the picture :

  • allocate more system resources to your database
  • migrate to a more powerful RDBMS than sqlite
GMB
  • 216,147
  • 25
  • 84
  • 135
  • I did the index over both columns and it improved the query speed significantly. Someone commented about EXPLAIN QUERY PLAN so I did that too. I also took my friend's suggestion and checked that the query times were not increasing exponentially as the sample size increased. In the end, I was not able to get the query to run on my laptop, I believe I was out of RAM. I moved everything to a more powerful machine and it ran in just under 10 minutes! Thank you. – EnKay Jan 11 '19 at 00:06
2

Obviously an index on (latitude, longitude) in each of the tables -- if you don't have them -- will help the query (actually, one index is probably sufficient). But the basic problem could be the size of the data.

This will be particularly true if the result set is . . . really big. You calculate the size by running:

select sum(p.cnt * lut.cnt)
from (select latitude, longitude, count(*) as cnt
      from allpos
      group by latitude, longitude
     ) p join
     (select latitude, longitude, count(*) as cnt
      from policylutgrid
      group by latitude, longitude
     ) lut
     on p.latitude = lut.latitude and p.longitude = lut.longitude;

This is the size of your expected table. If it is really big -- say more than a few billion rows -- than the size of the result set may be the determining factor. If that is the case, then the issue are duplicate values of lats/longs in the tables. Further investigation might suggest what to do about them.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thank you. I ran that and got an answer that I was expecting, which is good, and not billions of rows, which is also good. I will try the index over both columns and see if it at least runs to the end. Otherwise I will take the other suggestion of moving to a more powerful RDBMS. – EnKay Jan 02 '19 at 18:01