I am writing a proof-of-concept app which is intended to take live clickstream data at the rate of around 1000 messages per second and write it to Amazon Redshift.
I am struggling to get anything like the performance some others claim (for example, here).
I am running a cluster with 2 x dw.hs1.xlarge nodes (+ leader), and the machine that is doing the load is an EC2 m1.xlarge instance on the same VPC as the Redshift cluster running 64 bit Ubuntu 12.04.1.
I am using Java 1.7 (openjdk-7-jdk from the Ubuntu repos) and the Postgresql 9.2-1002 driver (principally because it's the only one in Maven Central which makes my build easier!).
I've tried all the techniques shown here, except the last one.
I cannot use COPY FROM
because we want to load data in "real time", so staging it via S3 or DynamoDB isn't really an option, and Redshift doesn't support COPY FROM stdin
for some reason.
Here is an excerpt from my logs showing that individual rows are being inserted at the rate of around 15/second:
2013-05-10 15:05:06,937 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 170
2013-05-10 15:05:18,707 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:05:18,708 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 712
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 167
2013-05-10 15:06:14,381 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
What am I doing wrong? What other approaches could I take?