2

I have table with schema:

test1 (id integer primary key , padding text)

(there is index on id)

I want to have 10 millions of rows with random padding (padding length smaller than 1024 signs). How to generate it fast and insert it into table?

I'm trying right now this solution:

insert into test1 (select *, random_string(1024) from generate_series(0, 10000000));

where random_string is a function:

create or replace function random_string(length integer) returns text as 
$$
declare
  chars text[] := '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
  result text := '';
  i integer := 0;
  length2 integer := (select trunc(random() * length + 1));
begin
  if length2 < 0 then
    raise exception 'Given length cannot be less than 0';
  end if;
  for i in 1..length2 loop
    result := result || chars[1+random()*(array_length(chars, 1)-1)];
  end loop;
  return result;
end;
$$ language plpgsql;

It seems that postgresql is creating temporary table and only generating of this temporary table will take 2 hours.

user364622
  • 366
  • 1
  • 7
  • 18
  • Do you want every one of these strings to be 1024 characters long? – Mike Sherrill 'Cat Recall' Dec 18 '11 at 18:03
  • @Catcall: Evidently not, since `length2` is set to `trunc(random() * length + 1)`. – ruakh Dec 18 '11 at 18:36
  • @ruakh: I read what the code does. I also read "padding length smaller than 1024 signs", which could easily mean "pad strings shorter than 1024 bytes (or chars, not necessarily the same thing) to 1024 bytes (or chars). I always ask questions when the code and the specs don't match. People hate me for that. – Mike Sherrill 'Cat Recall' Dec 18 '11 at 22:22
  • @Catcall: Got it, thanks. I approve of asking that sort of question, and in fact, I do so myself. That said, I think it's best to make clear *why* you're asking the question, since your goal is to call attention to the discrepancy, rather than really to find out "yes" or "no". – ruakh Dec 18 '11 at 22:32

2 Answers2

4

If efficiency is a high concern, then it may make more sense to generate your data using a different language, one that is more carefully optimizable, and then use a COPY statement to copy the data into your table. For example, if you're on a Linux-y system, you can save this C99 program as generate_test1_data.c:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>


#define MIN_ID 0
#define MAX_ID 10000000

#define MAX_LEN 1024

#define CHARS "0123456789" \
              "ABCDEFGHIJKLMNOPQRSTUVWXYZ" \
              "abcdefghijklmnopqrstuvwxyz"


int main()
{
    char const * const chars = CHARS;
    int const num_chars = strlen(chars);

    printf("COPY test1 FROM STDIN;\n");

    for(int i = MIN_ID; i <= MAX_ID; ++i)
    {
        printf("%d\t", i); // can be optimized if needed
        int const len = rand() % MAX_LEN + 1;
        for(int j = 0; j < len; ++j)
            putchar(chars[rand() % num_chars]);
        putchar('\n');
    }

    printf("\\.\n");
    return 0;
}

and then run these commands:

gcc -std=c99 -Wall generate_test1_data.c -o generate_test1_data
./generate_test1_data > populate_test1.sql
psql -U ... -d ... -f populate_test1.sql

On a development box that I have handy at the moment, this whole thing would take well under ten minutes (or at least, I tried it on one-hundredth the data, and that took well under six seconds). That's a pretty powerful box, so on your system it may take longer than that — maybe even much longer — but, I think, not nearly as long as your current approach.

ruakh
  • 175,680
  • 26
  • 273
  • 307
4

The fastest way to do this is to probably

  • create the table with no constraints,
  • use an external program to generate the random data,
  • use COPY to load it into the table,
  • then add the constraints.
Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185