12

I've just discovered the TABLESAMPLE clause but surprisingly it doesn't return the number of rows i've specified.

The table that i've used has ~14M rows and i wanted an arbitrary sample of 10000 rows.

select * from tabData TABLESAMPLE(10000 ROWS)

I get not 10000 but a different number everytime i execute it(between 8000 and 14000).

What's going on here, have i misunderstood the intended purpose of TABLESAMPLE?

Edit:

David's link explains it pretty well.

This returns always 10000 roughly random rows in an efficient way:

select TOP 10000 * from tabData TABLESAMPLE(20000 ROWS);

and the REPEATABLE option helps to get always the same (unless data has changed)

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS) REPEATABLE(100);

Since i wanted to know if it's more expensive to use TABLESAMPLE with a large number of rows to ensure(?) that i get the correct row-number, i've measured it;

1.loop (20 times):

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS);

(9938 row(s) affected)
(10000 row(s) affected)
(9383 row(s) affected)
(9526 row(s) affected)
(10000 row(s) affected)
(9545 row(s) affected)
(9560 row(s) affected)
(9673 row(s) affected)
(9608 row(s) affected)
(9476 row(s) affected)
(9766 row(s) affected)
(10000 row(s) affected)
(9500 row(s) affected)
(9941 row(s) affected)
(9769 row(s) affected)
(9547 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(9478 row(s) affected)
First batch(only 10000 rows) completed in: 14 seconds!

2.loop (20 times):

select TOP 10000 * from tabData TABLESAMPLE(10000000 ROWS);

(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
Second batch(max rows) completed in: 13 seconds!

3.loop: counterscheck with 100% random rows using ORDER BY NEWID():

select TOP 10000 * from tabData ORDER BY NEWID();

(10000 row(s) affected)

Cancelled after one execution that lasted 23 minutes

Conclusion:

So suprisingly the approach with an exact TOP clause and a large number in TABLESAMPLE is not slower. Hence it's a very efficient alternative to ORDER BY NEWID() if it doesn't matter that the rows are not random per row but per page level(Each 8K page for the table is given a random value).

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939

4 Answers4

6

See the article here. You need to add a top clause and/or use the repeatable option to get the number of rows you want.

David Brabant
  • 41,623
  • 16
  • 83
  • 111
  • The repeatable option does not help to get the 10000 rows. It'll ensure that i get the same number everytime the query is executed(unitl data has not changed). Anyway, the article explains it very well. I can combine it with a `TOP`-clause(f.e. 10000) to get nearly 10000 rows. Thanks. – Tim Schmelter May 23 '12 at 18:53
  • 1
    Nor does TOP if the result is *less* than the TABLESAMPLE number. – Aaron Bertrand May 23 '12 at 18:54
  • @David: So actually you can get the exact number with TABLESAMPLE: `select TOP 10000 * from tabData TABLESAMPLE(20000 ROWS);`. Edited my answer accordingly. – Tim Schmelter May 23 '12 at 19:38
  • The `top` trick doesn't help when doing `tablesample (10 percent)`. – ca9163d9 Nov 26 '13 at 22:17
  • But is there a safe way to calculate the minimum number `x` to pass to `TABLESCAMPLE(x ROWS)` to return at least either the whole table or `n` rows (if I plan to do `TOP(n)` with a variable `n`)? Or is probability/math going to say that, with any number, there is still a chance that `TABLESAMPLE(x ROWS)` will return fewer than I need? – binki Jun 18 '14 at 18:55
3

From the documentation.

The actual number of rows that are returned can vary significantly. If you specify a small number, such as 5, you might not receive results in the sample.

http://msdn.microsoft.com/en-us/library/ms189108(v=sql.90).aspx

Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • Thanks, but [Davids Link](http://www.mssqltips.com/sqlservertip/1308/retrieving-random-data-from-sql-server-with-tablesample/) explains it better then MSDN. – Tim Schmelter May 23 '12 at 19:10
1

This behavior has been documented before. There is a good writeup on it here.

I believe you can fix it by passing REPEATABLE with the same seed each time. Here is a snippit from the writeup:

...you will notice that different numbers of rows are returned everytime. Without any data changing, re-running the identical query keeps giving different results. This is non -deterministic factor of TABLESAMEPLE clause. If table is static and rows are not changed what could be the reason to return different numbers of the rows to return in each execution. The factor is 10 PERCENT is not the percentages of the table rows or tables records, it is percentages of the table’s data pages. Once the sample pages of data selected, all the rows from the selected pages are returned, it will not limit the number of rows sampled from that page. Fill factor of all the pages varies depends on the data of the table. This makes script to return different row count in result set everytime it is executed. The REPEATABLE option causes a selected sample to be returned again. When REPEATABLE is specified with the same repeat_seed value, SQL Server returns the same subset of rows, as long as no changes have been made to the table. When REPEATABLE is specified with a different repeat_seed value, SQL Server will typically return a different sample of the rows in the table. .

Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
  • 1
    Thanks, but David's linked article explains it better. The repeatable option does not help to get the 10000 rows, it'll ensure to get the same everytime until the data was not changed. – Tim Schmelter May 23 '12 at 18:56
1

I've observed the same.

The page explanation definitely makes sense and rings a bell - You should see much more predictable row counts when your row size is fixed. Try it on a table with no nullable or variable-length columns.

In fact I just used it to prove a theory about using it to update (you were probably spurred by the same question I was), and choosing TABLESAMPLE (50000 ROWS) actually affected 49,849 rows.

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • So we cannot rely on `TABLESAMPLE` if the number of rows plays an important part? – Tim Schmelter May 23 '12 at 19:02
  • @TimSchmelter I can't see how, no. If the number of rows matters, use TOP or a subquery with ROW_NUMBER() or in SQL Server 2012 OFFSET / FETCH. The expensive part is going to be the sort if the sort needs to be randomized. – Aaron Bertrand May 23 '12 at 19:03
  • Hmm, skipped that part: David's link suggests to combine the exact `TOP` with a high row-number in `TABLESAMPLE`. It works (see my edit)! Do you think that a large number makes TABLESAMPLE more expensive then? – Tim Schmelter May 23 '12 at 19:31
  • 1
    @TimSchmelter I don't know, you'd have to test it, but I suspect it will be more expensive, yes. Personally I don't really trust it no matter how big you make the sample, so if the number of rows is important, I'd rather use something predictable (especially if I don't know that I'm gaining any performance from TABLESAMPLE in the first place). – Aaron Bertrand May 23 '12 at 19:39
  • Suprisingly there's no performance difference (see my edited question). Even if i change the order of execution. – Tim Schmelter May 23 '12 at 19:57