7

I'm about to release a FOSS data generator that can generate random yet meaningful data in CSV format. Rather belatedly, I guess, I need to poll the state of the art for such products - because if there is a well known and useful existing tool, I can write my work off to experience. I am aware of of a couple of SQL Server specific tools, but mine is not database specific.

So, links? And if you have used such a product, what features did you find it was missing?

Edit: To add a bit more info on my tool (Ooh, Matron!) it is intended to allow generation of any kind of random data from existing data files, and supports weighting. It is XML based (sorry, folks) and lets you say things like:

<pick distribute="20,80" >
  <datafile  file="femalenames.dat"/>
  <datafile  file="malenames.dat"/>
<pick/>

to select female names about 20% of the time and male names 80% of the time.

But the purpose of this question is not to describe my product but to get info on other tools.

Latest: If anyone is interested, they can get the alpha of my data generator at http://code.google.com/p/csvtest

  • @Dirk: I've rolled back your addition of the tag because, while you have provided a reasonable answer in R, Neil didn't ask about a broadly defined task. Tagging the question with a language *reduces* the focus, and that should be the OPs choice. – dmckee --- ex-moderator kitten Aug 22 '09 at 18:30
  • @Neil: does this generator concern itself with a particular data domain, or are we talking about a broadly flexible tool here? Are we talking about generating n-tuples of numbers, or could I fake up a whole SO data dump? – dmckee --- ex-moderator kitten Aug 22 '09 at 18:33
  • @dmkee: Whatever. Adding an R tag would open it to the eyes of the R community which may lead to insightful answers for the _statistical_ nature of the question which you yourself raised as well in your question to Neil. I guess we all agree that the _programming_ aspect of the questions isn't all that demanding. – Dirk Eddelbuettel Aug 22 '09 at 18:38
  • *I guess we all agree that the programming aspect of the questions isn't all that demanding.* That depends entirely on the nature of the data. For instance: are their inter-field constraints? And how complex can they be? – dmckee --- ex-moderator kitten Aug 22 '09 at 18:49
  • I'm certainly happy about adding an R tag, provided none of my original tags are removed to make way for it. –  Aug 22 '09 at 18:52
  • @dmckee Yes, it supports foreign keys and many to many relationships. –  Aug 22 '09 at 18:54
  • @Neil: You get up to five tags, so feel free to add 'r' which still leaves you another one to pick. – Dirk Eddelbuettel Aug 22 '09 at 18:57
  • @Neil, so 0.2 and 0.8 probability to draw from one of the files. How do we then draw from within the files? Is it even weight (p=1/n)? Sampling with or without replacement? – Dirk Eddelbuettel Aug 22 '09 at 18:59
  • @Dirk Like I said, the purpose of the question is to get a list of competing products (of which R is definitely one), rather than describe my own product. But to give a hint, it uses CSV for both input and output. –  Aug 22 '09 at 19:04
  • 2
    Check out a bunch of other such tools in answers to my question here: http://stackoverflow.com/questions/591892/tools-for-generating-mock-data – Bill Karwin Aug 22 '09 at 19:35
  • @Bill Thanks for that - very useful. Your general opinion seems to be that they were not to hot? Obviously, I hope you will say "yes". And what is your current opinion? Also, could you make this an answer - lots pf people don't read comments. –  Aug 22 '09 at 19:42

2 Answers2

1

That can be a one-liner in R where I use the littler scripting front-end:

# generate the data as a one-liner from the command-line
# we set the RNG seed, and draw from a bunch of distributions
# indented just to fit the box here
edd@ron:~$ r -e'set.seed(42); write.csv(data.frame(y=runif(10), x1=rnorm(10),    
                x2=rt(10,4), x3=rpois(10, 0.4)), file="/tmp/neil.csv", 
                quote=FALSE, row.names=FALSE)'
edd@ron:~$ cat /tmp/neil.csv
y,x1,x2,x3
0.914806043496355,-0.106124516091484,0.830735621223563,0
0.937075413297862,1.51152199743894,1.6707628713402,0
0.286139534786344,-0.0946590384130976,-0.282485683052060,0
0.830447626067325,2.01842371387704,0.714442314565005,0
0.641745518893003,-0.062714099052421,-1.08008578470128,0
0.519095949130133,1.30486965422349,2.28674786332467,0
0.736588314641267,2.28664539270111,-0.73270267483628,1
0.134666597237810,-1.38886070111234,-1.45317770550920,1
0.656992290401831,-0.278788766817371,-1.01676025893376,1
0.70506478403695,-0.133321336393658,0.404860813371462,0
edd@ron:~$

You have not said anything about your data-generating process, but rest assured that R can probably cope with just about any requirement, including multivariate normal, t, skew-t, and more. The (six different) random-number generators in R are also of very high quality.

R can also write to DBs, or read parameters from it, and if it needs to be on Windoze then the Rscript front-end could be used instead of littler.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • I am aware of R - I've actually answered a couple of questions here on it. The aim of my product is to be much simpler than writing an R program. –  Aug 22 '09 at 18:40
  • So why not take R as a given -- and let your request be reduced to one line of code? But if you don't want that, can you make it clearer why you need to re-invent / re-program subsets of what R already does for you? – Dirk Eddelbuettel Aug 22 '09 at 18:46
  • As I said - to make things simpler. I think it is fair to say that even R users don't find it too easy to use. And see my edit to my question. –  Aug 22 '09 at 18:51
1

I asked a similar question some months ago:

Tools for Generating Mock Data?

I got some sincere suggestions, but most were not suitable for my needs. Either expensive (non-free) software, or else not flexible enough w.r.t. data types and database structure, or range of mock data, or way too slow (e.g. the Rails ActiveRecord solution).

Features I was looking for were:

  • Generate mock data to fill existing database tables
  • Quick to generate > 1 million rows
  • Produce either SQL script format or flat file suitable for importing
  • Scriptable command-line interface, not a GUI
  • Not dependent on Microsoft Windows environment

Nice-to-have features:

  • Extensible/configurable
  • Open-source, free license
  • Written in a dynamic language like Perl/PHP/Python
  • Point it at a database and let it "discover" the metadata
  • Integrated with testing tools (e.g. DbUnit)
  • Option to fill directly into the database as it generates data

The answer I accepted as Databene Benerator. Though since asking the question, I admit I haven't used it very much.

I was surprised that even when asking the community, the range of tools for generating mock data was so thin. This seems like a niche waiting to be filled! I'll be interested to see what you release.

Community
  • 1
  • 1
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Also, it will do all of your prime requirements, with the possible exception of "quick to generate", (because I don't know what you may mean by "quick"). But it won't (currently) do any of your "nice to haves" except for the FOSS requirement. –  Aug 22 '09 at 23:14