0

I have two large (~100 million rows) tables I'm trying to join. I have indices on both columns used in the join. Selecting the first 1000 rows takes several hours, but when it's done, SQLite Studio reports that it only took a minute. Then, it takes another several hours for SQLite Studio to count the rows for my results and if I try to open another query window, it becomes unresponsive for these hours. The entire time, task manager shows around 25% CPU usage and 7-8 MB/s disk usage for the process. I also tried selecting the top 10k rows and it took 11 hours to complete and another 11 hours to get the row count, but reported that the query finished in 4 minutes. Here is the query:

Select d.PRC, s.prccd, abs(abs(d.PRC)-s.prccd), *
from dsf d
join secd.secd s
on s.datadate=d.DATE and substr(s.cusip,1,8)=d.CUSIP
where abs(abs(d.PRC)-s.prccd)>.0006
limit 10000
  1. Why is this taking so long? I know 100 million rows is a lot, but with sorted indices, shouldn't joining happen in linear time? Adding the indices took several minutes, not hours, and that should be O(n log n) since it has to sort. I get the same results without using substr(). So why is it taking so long?

  2. Why is SQLite Studio reporting that it only takes a minute or two?

  3. Why does SQLite Studio take so long to count the result rows, after the results are already displayed?

EDIT: Output of EXPLAIN QUERY PLAN

5   0   0   SCAN TABLE dsf AS d
7   0   0   SEARCH TABLE secd AS s USING INDEX secd_datadate (datadate=?)
Kalev Maricq
  • 617
  • 1
  • 7
  • 24
  • Sounds like heavy job on actually reading data from disk. The query is resolved fast, but then it takes a long time to read bytes from the disk. At least it looks like it. How it works if you execute it **and display results** from official command line client `sqlite3.exe` (or `sqlite3` for Linux/Mac)? – Googie Jul 10 '20 at 07:41
  • 1
    Use [`EXPLAIN QUERY PLAN`](https://www.sqlite.org/eqp.html) to see if it's actually using any indexes. Run [`ANALYZE`](https://www.sqlite.org/lang_analyze.html) if you haven't already. – Shawn Jul 10 '20 at 10:54
  • @Googie What would cause that to happen and how can I fix it? I'm running it in sqlite3.exe and it's been running for 10 minutes and has found one result, so looks like it's still slow. – Kalev Maricq Jul 11 '20 at 05:39
  • @Shawn I ran PRAGMA optimize which should run analyze. It didn't change the query plan. I posted the results of EXPLAIN QUERY PLAN above. – Kalev Maricq Jul 11 '20 at 05:39
  • Yeah, the table scan on `dsf` is killing it along with only one of the columns of `secd.secd` being indexed. I don't think there's a way to avoid the full scan on one of the tables since the `WHERE` uses columns from both tables. Hmm. Might play around with [indexes on expressions](https://www.sqlite.org/expridx.html) to see if you can get something that can be used to narrow down things in `secd.secd`, at least... `CREATE INDEX foo ON secd.secd(datadate, substr(cusip,1,8))` maybe, assuming that's accepted (I haven't tried). – Shawn Jul 11 '20 at 07:29
  • 1
    An index on `dsf(date,cusip)` and a re-analyze might make it flip the order of the join, too, and scan `secd.secd` and search `dsf`, which may or or may not help. – Shawn Jul 11 '20 at 07:33
  • (Oh, and a `LIMIT` without an `ORDER BY` gives you no guarantee about *which* rows will be returned) – Shawn Jul 11 '20 at 07:34
  • @Shawn Adding that index does make is scan secd instead and use the new index. It's a lot faster, like a few seconds instead of hours. What's the general principle here and how can I fix stuff like this in the future? – Kalev Maricq Jul 11 '20 at 18:31

0 Answers0