I have a csv file with about 1,500 fields and 5-6 million rows. It is a dataset with one row for each individual who has received public benefits at some point since ISO week 32 in 1991. Each field represents one week and holds a number relating to the specific benefit received in that particular week. If the individual has received no benefits, the field is left blank (''). In addition to the weekly values there are a number of other fields (ID, sex, birthdate, etc.)
The data set is updated quarterly with an added field for each week in the quarter, and an added row for each new individual.
This is a sample of the data:
y_9132,y_9133,y_9134,...,y_1443,id,sex,dateofbirth
891,891,891,...,110,1000456,1,'1978/01/16'
110,112,112,...,997,2000789,0,'1945/09/28'
I'm trying to convert the data to a tabular format so it can be analysed using PostgreSQL with comlumn store or similar (Amazon Redshift is a possiblity).
The fields beginning with "y_" represents the year and week of the received public benefits. In a tabular format the field name should be converted to a row number or a date, starting with monday in ISO week 32 in 1991 (1991/08/05).
The tablular dataset I'm trying to convert the csv-file to would look like this: (Week is just a sequential number, starting with 1 for the date '1991/08/05')
week,benefit,ID
1,891,1000456
2,891,1000456
3,891,1000456
...
1211,110,1000456
1,110,2000789
2,112,2000789
3,112,2000789
...
1211,997,2000789
I have written a function in PostgreSQL that works but, it is very slow. The entire conversion takes 15h. I have tried using my laptop with an SSD and 8GB RAM. I also tried it on an Amazon RDS instance with 30GB memory. Still slow. The PostgreSQL function splits the csv in chunks. I've experimented a bit and 100K rows pr. batch seems fastest (yeah, 15h fast).
To be clear, I'm not particularly looking for solution using PostgreSQL. Anything will do. In fact, I'm not sure why I would even use a DB for this at all.
That said, here are my functions in PostgreSQL:
First function: I load part of the csv file into a table called part_grund. I only load the fields with the weekly data and the ID.
CREATE OR REPLACE FUNCTION DREAMLOAD_PART(OUT result text) AS
$BODY$
BEGIN
EXECUTE 'DROP TABLE IF EXISTS part_grund;
CREATE UNLOGGED TABLE part_grund
(id int, raw_data text[],rn int[]);
INSERT INTO part_grund
SELECT raw_data[1300]::int
,raw_data[1:1211]
,rn
FROM grund_no_headers
cross join
(
SELECT ARRAY(
WITH RECURSIVE t(n) AS
(
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 1211
)
SELECT n FROM t
) AS rn) AS rn;
CREATE INDEX idx_id on part_grund (id);';
END;
$BODY$
LANGUAGE plpgsql;
Second function: Here, the data is transformed using the unnest function.
CREATE OR REPLACE FUNCTION DREAMLOAD(startint int, batch_size int, OUT result text) AS
$BODY$
DECLARE
i integer := startint;
e integer := startint + batch_size;
endint integer;
BEGIN
endint := (SELECT MAX(ID) FROM part_grund) + batch_size;
EXECUTE 'DROP TABLE IF EXISTS BENEFIT;
CREATE UNLOGGED TABLE BENEFIT (
ID integer
,benefit smallint
,Week smallint
);';
WHILE e <= endint LOOP
EXECUTE 'INSERT INTO BENEFIT
SELECT ID
,unnest(raw_data) AS benefit
,unnest(rn) AS week
FROM part_grund
WHERE ID between ' || i || ' and ' || e-1 ||';';
i=i+batch_size;
e=e+batch_size;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql;
As I mentioned above, it works but, it is painfully slow. So, suggestions to a faster way of doing this would be very appreciated.