How to build local environment with large databases

Question

I have two storages (PostgreSQL, MongoDB) and as I need to develope application locally on my computer (ideally offline), i need data from those storages to be copied to my HDD.

Anyway those are massive databases with around hundreds of gigabytes of data.

I don't need all data stored there, just sample of them to be able to launch my app locally on that data. Both storages have some capable tools for data export (pg_dump, mongodump, mongoexport etc.).

But I don't know how to easily and effectively do the export of small sample of data. Even if I would take the list of all tables/collections and build some whitelist, which would define tables, which should be limited on number of rows, there comes troubles with triggeres, functions, indexes etc.

score 1 · Answer 1 · answered Oct 11 '11 at 23:42

I don't know about testing for MongoDB, but for PostgreSQL here's what I do.

I follow a pattern while developing against databases that separates the DB side from the app side. For testing the DB side, I have a test schema which includes a single stored procedure that resets all the data in the real schema. This reset is done following the MERGE pattern (delete any records with an unrecognized key, update records that have matching keys but which are changed, and insert missing records). This reset is called before running every unit test. This gives me simple, clear test coverage for stored functions.

For testing code that calls into the database, the database layer is always mocked, so there are never any calls that actually go to the database.

What you are describing suggests to me that you are attempting to mix unit testing with integration testing, and I rather strongly suggest that you don't do that. Integration testing is what happens when you've already proved base functionality and want to prove integration between components and probably also performance, too. For IT, you really need a representative data set on representative hardware. Usually this means a dedicated machine, and using hudson for CI.

The direction you seem to be going in is going to be difficult because, as you've already noticed, it's difficult to handle that volume of data and it's difficult to generate representative data sets (most CI systems actually use production data that's been "cleaned" of sensitive information)

Which is why most of the places I've worked have not gone that way.

score 0 · Answer 2 · answered Oct 07 '11 at 20:56

Just copy it all. Several hundreds gigabytes is not very much by today's standards — you can buy 2000GB disk for $80.

If you test your code on small sample data then how do you know if your coding will be efficient enough for full database?

Just remember to encrypt it with strong password if it goes out of your company building.

How to build local environment with large databases

2 Answers2