0

There is a customer who wants to know how long data migrations/transitions will run on their database. Their DB is ~5TB and even if we could get a copy, it would have to be anonymized (bad idea).

So, is there a way to scan the client database, examine table size, PK/FK, etc., and create a scale model of that database at a configurable size (1%, 5%, etc.) with populated data that matches the related records of the original DB?

Shadow
  • 33,525
  • 10
  • 51
  • 64
KevDog
  • 5,763
  • 9
  • 42
  • 73
  • Hint: Look at where I work for why we might be looking for solutions other than a fake data generator. Changing the data topology for each customer simulation would be a nightmare, even setting it up for one would take days. – KevDog Jan 18 '23 at 12:43
  • 1
    @KevDog I think the idea is to generate fake data that fits the customer's topology. In other words, dump a schema-only version of their tables (`mysqldump --no-data ...`), then restore to a test environment and fill the tables with fake data. But it would match that customer's schema and tables exactly. – Bill Karwin Jan 18 '23 at 13:49
  • I’m voting to close this question because I can't delete it. – KevDog Jan 18 '23 at 13:51

1 Answers1

2

The mysqldump tool has an option to apply a condition to all tables, so you could dump a 1% sampling of a MySQL database like this:

mysqldump --where "rand() < 0.01" mydatabase ...

(where mydatabase is the name of your schema and ... is any other options you need for the mysqldump command)

However, this takes a random sample of rows per table. There's no guarantee that the rows are related. So if the tables have any foreign keys, it's 99% likely that a given row that has a foreign key won't have its parent row included in the dump file.

Also the condition is applied to every table. You might have exceptions, so you want to dump full data from a few tables. You might have to run mysqldump several times with different specific lists of tables and different --where conditions.

Also there is no anonymization applied. Any sensitive data is dumped in clear text.

Ultimately I agree with the comment above from tadman — use a tool to generate synthetic data. Then you can generate as much as you want per table whenever you want, and you are guaranteed none of it is sensitive data.

Re your comment above:

You can match the customer's schema and tables by dumping with mysqldump --no-data ... and restoring to a test instance. That part can be done in a minute.

Then fill the tables with synthetic data. To your point, yes, it probably does take days to do this for a given schema, if you need the data to match the general format of customer's values, data frequency, etc. It depends how accurate you need the test to be.

I asked a question years ago about synthetic data generation tools: Tools for Generating Mock Data? But it was so long ago that some of those tools mentioned may be out of date. Asking for recommendations of tools or technology is discouraged on Stack Overflow in more recent times. But if you read some of the answers on that linked thread, it might give you some names to search for.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828