6

Is there a tool (for PosgreSQL, ideally), which can make a small, but consistent sample of a big database?

The this is we need a testing database, but we don't want to fully copy the production one. First, because it is too big and second, the nature of testing implies that the testing database will be recreated several times in the process.

Obviously, you can not simply take random rows from some tables, because this will violate the hell out of foreign keys and what not. So, I wonder is there a tool available that can do that?

Maxim Sloyko
  • 15,176
  • 9
  • 43
  • 49
  • You can import the database via cmd line. Make use of a batch file to import the db prior to the testing phase? Have you actually exported and imported the db to see how long it actually takes? It may not be as bad as you think. The fact that you already have good data, I'd leverage that as much as possible. – Aaron McIver Oct 07 '10 at 15:02
  • What is the problem with big database? I would actually prefer using a big database for testing because it will also tell you while testing where the performance hazards are. In any case, I do not know of any tool right now that can do this. – mezzie Oct 07 '10 at 15:06
  • @mezzie It is OK, when you do normal testing, i.e. when you test the app. But first, you test the _update_process_, and every time it fails, you have to recreate the database again, which, frankly, happens more often then I would like to admit :) – Maxim Sloyko Oct 07 '10 at 15:15
  • I see what you are saying but you do not need a use pg_dump/pg_restore all the time. You can just zip up the data directory and unzip it if you need to reloaded. It will be faster too. – mezzie Oct 07 '10 at 17:02

7 Answers7

1

Yes, I wrote a tool to do this: https://github.com/mla/pg_sample

From the README:

pg_sample is a utility for exporting a small, sample dataset from a larger PostgreSQL database. The output and command-line options closely resemble the pg_dump backup utility (although only the plain-text format is supported).

The sample database produced includes all tables from the original, maintains referential integrity, and supports circular dependencies.

mla
  • 172
  • 1
  • 7
1

I once built such a tool for the IDMS system.

I was in the process of making it work for SQL systems too when the managers of the company we were mergered into told me I could not continue wasting my time on such futile and unnecessary pieces of software.

Until this day, I have still neither seen nor heard of any commercially available thing that matches what I achieved way back then.

Erwin Smout
  • 18,113
  • 4
  • 33
  • 52
1

What about generating some mock data with a tool like databene benerator, just as much as you want, and store them for reuse.

Community
  • 1
  • 1
Pascal Thivent
  • 562,542
  • 136
  • 1,062
  • 1,124
0

You can create custom scripts to create meaningful copy of production data or use commercial products such as Informatica data subset or IBM optim

0

You can use pg_dump --schema-only to dump only the schema of the database. Then use pg_restore to load the dump into a new database. From there you have a few options:

  1. Create your data by hand; this will allow you to cover edge cases but will take a while if you want to test on a lot of data.

  2. Script a few queries to import random sections of each table on your database. As you said, this will violate foreign key constraints, but when it does just ignore the failure. Keep track of the number of successes and keep going until you have as many data items as you want. This may not work depending on your schema, however, if you have very restrictive constraints, because it might take too long to hit on succeeding data.

kerkeslager
  • 1,364
  • 4
  • 17
  • 34
0

Back in my Oracle days we would have a test database with a very small auto generated set of data. At the time it was about a 5th of the production database size. We would than copy the stats from the production database and put them into our test database to make it think it had billions of rows in tables when in reality it only had 500,000. This allowed us to get explain plans in test that would we would get in production. It has it's values, but doesn't solve all your question and I'm not sure how easy or even feasible it is to mess with PostgreSQL's stats.

Kuberchaun
  • 29,160
  • 7
  • 51
  • 59
0

Please use IBM OPTIM to production database sample. If you have Prod data available in DB, go for OPTIM TDM or if there is no Prod data go for IBM OPTIM TDF (Test data Fabricator)

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 08 '22 at 15:33