0

I have a SQL Server installed on ubuntu with only one table (accessed via sqlcmd) has nearly 2 trillion rows. It currently has a clustered index on column index A (which is not useful for me).

I would like to do a select statement on the data

select b, c, d, e, f 
from table 
where b = 'a vehicle number'

and write the results to a .csv file. But this query takes around 20 minutes to run on the computer even after creating a nonclustered index on column b.

I have to run the same query for 400 different vehicle numbers and it will take days to execute.

How can I make this process faster? Can I run two select statements (for two vehicles) simultaneously in some way so that both these statements are complete within 20 minutes? (not 20 + 20 = 40)?

I have a powerful PC with 128 GB RAM, 40 cores, Intel xeon 4210 @ 2.2Ghz.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Hold My Stack
  • 45
  • 1
  • 6
  • `select b,c,d,e,f from table where b = 'a vehicle number'` -- should be extremely fast (a few milliseconds) if you have a UNIQUE index on column `b`. Even if the index is not unique, I doubt there will be many collisions (multiple rows with the same value); it should still be extremely fast. – The Impaler Nov 13 '21 at 13:38
  • How big is the table in Mbs? How many rows do you expect to get back for each query? Have you tried using `where b in (‘vehicle number 1','vehicle number 2', etc..) order by b` And doing post processing to split the csv up – Andrew Sayer Nov 13 '21 at 13:41
  • Hi, thanks for replying. The values on b are not unique. There will be around a few million rows having same b. @TheImpaler – Hold My Stack Nov 13 '21 at 13:41
  • 2
    Whether the index is used depends largely on the data distribution and how accurate the statistics are, eg if SQL Server thinks 'a vehicle number' might be 75% of the table it might choose to scan the table anyway. You might find it a lot quicker to create an index on `b` that's *covering* then drop it when done. – Stu Nov 13 '21 at 13:42
  • @AndrewSayer, The table is around 300 GB (has data for 1200 vehicles out of which only 400 are required by me). I haven't tried the solution that you suggested. But wont it generate a large csv file which is again difficult to split? The idea was to generate 400 different CSV files. – Hold My Stack Nov 13 '21 at 13:44
  • @Stu, I tried creating an index like this: create index (index name) on Table (b) include (c,d,e,f); But this exhausted the database space (150 GB free space). Indexing the database is taking three hours just for column b alone (without include). – Hold My Stack Nov 13 '21 at 13:53
  • 1
    @HoldMyStack sounds like your server is under-resourced for its purpose then, you should have *at least* 1.5x additional space for operations such as this; I have a 2TB usb drive kicking around somewhere, your database would fit on my phone! – Stu Nov 13 '21 at 13:57
  • @Stu I have another doubt. I want to select 5 columns (b,c,d,e,f) using the select statement.So If I create an index ( on b include (c, d, e,f)) won't it take the same time as retrieving the data on these 5 columns? Because for indexing the server has to go through these 5 columns and with that time I can retrieve the data right? (I'll have to retrieve it only once) – Hold My Stack Nov 13 '21 at 14:11
  • No, because once the index is created on `b`, each query you execute that filters on `b` and selects those additional columns will seek directly to the rows for `b` and scan just the rows required in the index; currently it's probably scanning the clustered index in its entirety each time. – Stu Nov 13 '21 at 14:35
  • @Stu Sorry, what I meant is since I have to spend time indexing 5 columns (covering index) for retrieving the same 5 columns later, won't it be better that I retrieve them directly instead of spending time indexing them first? If I am indexing the 5 columns first, the database will have to go through them at least once right? So instead of indexing while visiting these columns, I can choose to retrieve them right? – Hold My Stack Nov 13 '21 at 14:56
  • If we assume an even distribution of vehicles in your table then you’re expecting to grab 1/3 = 100gb of the table. This is probably a case where a scan is going to be quickest. I would stick with my previous suggestion, look into Linux methods of splitting a sorted csv file (sorted makes this easier as it just needs to read through the file and check for when it changes). Otherwise this is the sort of scale where hash partitioning by vehicle might make sense (and full scanning just a partition per query) but this is a big change – Andrew Sayer Nov 13 '21 at 18:22
  • What about changing the clustering to column `b` – Charlieface Nov 14 '21 at 23:18

1 Answers1

0

Create a table to store the vehicle numbers and join it to the other table, that way a single query will retrieve all rows you are interested in, and you can enumerate the result.

select t1.b, t1.c, t1.d, t1.e, t1.f 
from table1 as t1
inner join table2 as t2 on t1.b = t2.b

You might also try making the primary key/clustered index a composite of a, b. With that many rows it might not be practical, but maybe worth a shot.

Also, it seems like something might be wrong, use DBCC to check integrity and periodically rebuild the indexes.

I take it they don't provide Sql Server Management Studio with the linux version? If they do, use it to display the query cost, it will break it down to each step, and give you values for computational and i/o costs of each.

Mark McGinty
  • 756
  • 7
  • 13