The mysqldump
tool has an option to apply a condition to all tables, so you could dump a 1% sampling of a MySQL database like this:
mysqldump --where "rand() < 0.01" mydatabase ...
(where mydatabase
is the name of your schema and ...
is any other options you need for the mysqldump command)
However, this takes a random sample of rows per table. There's no guarantee that the rows are related. So if the tables have any foreign keys, it's 99% likely that a given row that has a foreign key won't have its parent row included in the dump file.
Also the condition is applied to every table. You might have exceptions, so you want to dump full data from a few tables. You might have to run mysqldump several times with different specific lists of tables and different --where
conditions.
Also there is no anonymization applied. Any sensitive data is dumped in clear text.
Ultimately I agree with the comment above from tadman — use a tool to generate synthetic data. Then you can generate as much as you want per table whenever you want, and you are guaranteed none of it is sensitive data.
Re your comment above:
You can match the customer's schema and tables by dumping with mysqldump --no-data ...
and restoring to a test instance. That part can be done in a minute.
Then fill the tables with synthetic data. To your point, yes, it probably does take days to do this for a given schema, if you need the data to match the general format of customer's values, data frequency, etc. It depends how accurate you need the test to be.
I asked a question years ago about synthetic data generation tools: Tools for Generating Mock Data? But it was so long ago that some of those tools mentioned may be out of date. Asking for recommendations of tools or technology is discouraged on Stack Overflow in more recent times. But if you read some of the answers on that linked thread, it might give you some names to search for.