Should I generate massive amounts of SQL data on the client or in SQL Server?

Question

I am writing a program to generate a massive (~1 billion records spread across ~20 tables) amount of data and populate tables in SQL Server. This is data that spans across multiple tables with potentially multiple foreign key constraints, as well as multiple 'enum' like tables, whose distribution of values need to be seemingly random as well and are referenced often from other tables. This leads to a lot of ORDER BY NEWID() type code, which seems slow to me.

My question is: which strategy would be more performant:

Generate and insert data in SQL Server, using set based operations and a bunch of ORDER BY NEWID() to get randomness
Generate all the data on a client (should make operations like choosing a random value from an enum table much faster), then import the data into SQL Server

I can see some positives and negatives from both strategies. Obviously the generation of random data would be easier and probably more performant in the client. However getting that data to the server would be slow. Otherwise, importing the data and inserting it in a set based operation should be similar in scale.

Has anyone done something similar?

Try both and see which is more performant. Seems straightforward. — dfundako, Mar 27 '18 at 20:28
@paparazzo ~1 billion records overall, spread across multiple tables — Andrew, Mar 27 '18 at 20:40
Explain the need for ORDER BY NEWID(). Brave soul you are attempting to ORDER BY a set with a billion rows. Far better to avoid any sorting on a billion rows. — Jim Horn, Mar 27 '18 at 20:41
@JimHorn I need the data to be distributed randomly, i.e. records should have fields with different values from a predefined enum. With set based operations, it seems like the fastest way to choose a random value in a small enum-like table. — Andrew, Mar 27 '18 at 20:42
Do you need all rows distributed randomly? If yes you're somewhat stuck with ORDER BY NEWID(), otherwise if you only need a subset then another option is to use an identity field, then generate a #temp table of random numbers from 1 to max(identity field), then SELECT from a billion rows JOIN #temp on... That way you're only spending time on the rows you need and not the entire set. — Jim Horn, Mar 27 '18 at 20:47
@JimHorn That could be a potential improvement. Note that I am not ORDER BY'ing the entire billion rows, but in each row I am using an ORDER BY NEWID() on another table to generate a random value in that row. Also, I am mainly wondering if this cost would be better suited on the client, given the amount of work done on a per-row basis. — Andrew, Mar 27 '18 at 20:54

score 4 · Accepted Answer · answered Mar 27 '18 at 20:59

ORDER BY NEWID(), as other members said can be extremely expensive operation. There are other, faster ways of getting random data in SQL Server:

SELECT * FROM StackOverflow.dbo.Users TABLESAMPLE (.01 PERCENT);

or

DECLARE @row bigint=(
SELECT RAND(CHECKSUM(NEWID()))*SUM([rows]) FROM sys.partitions
WHERE index_id IN (0, 1) AND [object_id]=OBJECT_ID(‘dbo.thetable’));

SELECT *
FROM dbo.thetable
ORDER BY (SELECT NULL)
OFFSET @row ROWS FETCH NEXT 1 ROWS ONLY;

Credits to Brent Ozar and his recent blog post: https://www.brentozar.com/archive/2018/03/get-random-row-large-table/

I would choose generation of massive data amounts on RDBMS side..

score 0 · Answer 2 · answered Mar 27 '18 at 22:06

You don't need to create billions of newid

Create one table with a million random and reference it a number of times. If you random repeats every million rows I suspect all would be OK.

Do a random stating point and increment. Use % on increment to cycle.

If you need values 0 - n again use a %.

Should I generate massive amounts of SQL data on the client or in SQL Server?

2 Answers2