Transpose and stack a large csv file?

Question

I have a csv file with about 1,500 fields and 5-6 million rows. It is a dataset with one row for each individual who has received public benefits at some point since ISO week 32 in 1991. Each field represents one week and holds a number relating to the specific benefit received in that particular week. If the individual has received no benefits, the field is left blank (''). In addition to the weekly values there are a number of other fields (ID, sex, birthdate, etc.)

The data set is updated quarterly with an added field for each week in the quarter, and an added row for each new individual.

This is a sample of the data:

y_9132,y_9133,y_9134,...,y_1443,id,sex,dateofbirth
891,891,891,...,110,1000456,1,'1978/01/16'
110,112,112,...,997,2000789,0,'1945/09/28'

I'm trying to convert the data to a tabular format so it can be analysed using PostgreSQL with comlumn store or similar (Amazon Redshift is a possiblity).

The fields beginning with "y_" represents the year and week of the received public benefits. In a tabular format the field name should be converted to a row number or a date, starting with monday in ISO week 32 in 1991 (1991/08/05).

The tablular dataset I'm trying to convert the csv-file to would look like this: (Week is just a sequential number, starting with 1 for the date '1991/08/05')

week,benefit,ID
1,891,1000456
2,891,1000456
3,891,1000456
...
1211,110,1000456
1,110,2000789
2,112,2000789
3,112,2000789
...
1211,997,2000789

I have written a function in PostgreSQL that works but, it is very slow. The entire conversion takes 15h. I have tried using my laptop with an SSD and 8GB RAM. I also tried it on an Amazon RDS instance with 30GB memory. Still slow. The PostgreSQL function splits the csv in chunks. I've experimented a bit and 100K rows pr. batch seems fastest (yeah, 15h fast).

To be clear, I'm not particularly looking for solution using PostgreSQL. Anything will do. In fact, I'm not sure why I would even use a DB for this at all.

That said, here are my functions in PostgreSQL:

First function: I load part of the csv file into a table called part_grund. I only load the fields with the weekly data and the ID.

CREATE OR REPLACE FUNCTION DREAMLOAD_PART(OUT result text) AS
$BODY$

BEGIN   

EXECUTE 'DROP TABLE IF EXISTS part_grund;
    CREATE UNLOGGED TABLE part_grund
    (id int, raw_data text[],rn int[]);

    INSERT INTO part_grund
    SELECT  raw_data[1300]::int
        ,raw_data[1:1211]
        ,rn
    FROM grund_no_headers
        cross join 
        (
        SELECT  ARRAY(
        WITH RECURSIVE t(n) AS 
            (
            VALUES (1)
            UNION ALL
            SELECT n+1 FROM t WHERE n < 1211
            )
            SELECT n FROM t
        ) AS rn) AS rn;

    CREATE INDEX idx_id on part_grund (id);';

END;
$BODY$
LANGUAGE plpgsql;

Second function: Here, the data is transformed using the unnest function.

 CREATE OR REPLACE FUNCTION DREAMLOAD(startint int, batch_size int, OUT result text) AS
$BODY$

DECLARE
    i integer    := startint;
    e integer    := startint + batch_size;
    endint integer;

BEGIN       

    endint := (SELECT MAX(ID) FROM part_grund) + batch_size;

    EXECUTE 'DROP TABLE IF EXISTS BENEFIT;

        CREATE UNLOGGED TABLE BENEFIT   (   
            ID integer
            ,benefit smallint
            ,Week smallint
            );';

    WHILE e <= endint LOOP

        EXECUTE 'INSERT INTO BENEFIT

                SELECT  ID
                    ,unnest(raw_data) AS benefit
                    ,unnest(rn) AS week
                FROM    part_grund
                WHERE   ID between ' || i || ' and ' || e-1 ||';';

                i=i+batch_size;
                e=e+batch_size;
            END LOOP;

END;
$BODY$
LANGUAGE plpgsql;

As I mentioned above, it works but, it is painfully slow. So, suggestions to a faster way of doing this would be very appreciated.

[This](http://stackoverflow.com/questions/24295855/python-csv-time-oriented-transposing-large-number-of-columns-to-rows) question is sort of the same thing. The file I'm loading is too large for this to work though. — Loke, Dec 16 '14 at 14:11
I think getting away from DBMS for the initial processing is the right idea. If I were seeking the fastest run time, I would do this with a C program. If I were looking for a higher level compromise between C and PostgreSQL, Perl (or Python if I had time to learn) would seem like the right path. — Politank-Z, Dec 16 '14 at 15:02
This normalisation could even be done with awk, but I'd write a C program, too. — joop, Dec 16 '14 at 15:24

Transpose and stack a large csv file?

0 Answers0