0

We recently released an open source project to stream data to Redshift in near realtime.

Github: https://github.com/practo/tipoca-stream

The realtime data pipeline stream data to Redshift from RDS.

  • Debezium writes the RDS events to Kafka.
  • We wrote Redshiftsink to sink data from Kafka to Redshift.

We have 1000s of tables which are streaming to Redshift, we use COPY command. We wish to load every ~10 minutes to keep the data as near realtime as possible.

Problem Parallel load becomes a bottleneck. Redshift is not good in ingesting data at such short interval. We do understand Redshift is not a realtime database. What is the best that can be done? Does Redshift plan to solve this in future?

Workaround that works for us! We have 1000+ tables in Redshift but we use not over 400 in a day. This is the reason we now throttle loads for the unused table when needed. This feature makes sure the tables which are in use are always near realtime and keep the Redshift less burdened. This was very useful.

Looking for suggestions from the Redshift community!

Alok Kumar Singh
  • 2,331
  • 3
  • 18
  • 37
  • More information will be needed to give you useful suggestions. 100's of queries in flight isn't an issue in and of itself especially if the queries run quickly. As you say the CPU usage isn't that high so the bottleneck is elsewhere. What is the disk IO utilization? The network utilization? Which queries are taking the time? How large are the files being COPYed? Best guess based on experience - you have large tables and joining with the staging data is slowing down as they grow. Check your distribution of these tables and try to limit the fact table read - but these are just guesses. – Bill Weiner Apr 27 '21 at 17:26
  • Thanks for the pointers to debug more. I have attached all the dashboards AWS provides. Please take a look once if it shows some bottleneck. Disk IO utlisation and network utilisation dashboards i could not find. Yes some queries might take more time due to large tables and we doing merge. – Alok Kumar Singh Apr 28 '21 at 07:53
  • Added query monitoring. It looks like the problem is we are stressing Redshift with all loads at the same time, need to spread the load in the timeline. – Alok Kumar Singh Apr 28 '21 at 08:11
  • You really need to look at the Network and Disk IO metrics for Redshift. There are lots of ways that you can misuse Redshift and these metric can help identify areas for improvement. – Bill Weiner Apr 28 '21 at 14:42
  • Added pictures for Read Througput, Read Latency and Read IOPS on disk for both the Redshift compute nodes for 3d. It is a straight line – Alok Kumar Singh Apr 28 '21 at 16:00
  • Added write IOPS, write Througput and write latency as well – Alok Kumar Singh Apr 28 '21 at 16:07
  • 1
    You are going to need to zoom in on the time the issues are seen. You are loading every 5 minutes so I assume that is how often a spike comes along. Is each 5 min load taking 5 minutes to complete? Then you will be always busy. What is your node type? If you can get 1min max and avg metrics we can compare these to the instance spec. Same thing with the network bandwidth. – Bill Weiner Apr 28 '21 at 16:49
  • 1
    Redshift is an analytic powerhouse but data needs to be organized and analyzed to achieve peak performance. The rate at which you are adding data is doable with Redshift but tricky. Your data loads need to be well organized and your UPSERT queries well written. Data distribution and sort order needs to be well thought out. At some point you just cannot push data any faster into Redshift as it is not a real-time database. To get what you have to work better we need to find what limitation is being hit and improve along that axis. – Bill Weiner Apr 28 '21 at 16:55
  • Thanks a lot for the help in showing how to debug. I will dig into it more to find the bottleneck. – Alok Kumar Singh Apr 28 '21 at 18:39
  • Confirmed we are hitting the limits in Redshift. When we stop all the loads and load only one table it loads in 2m otherwise 10m. – Alok Kumar Singh Apr 29 '21 at 08:53
  • So if you can improve that one load by addressing the aspect that is causing the bottleneck you likely will greatly improve the performance for all loads when running concurrently. So what does this load process look like? (queries, table DDLs, data sizes etc.) – Bill Weiner Apr 29 '21 at 13:52
  • Figured out the cause after we added metrics for every task the loader does. We now have duration and load speed metric. The files which we were loading in the incremental /realtime sink were big. Loading 50-60 files of 200kb per load has increased the speed from 70kb/second to 1Mb/second! We have a 2 node ra3.xlplus cluster in Mumbai AWS. How much is the maximum load speed we can get @BillWeiner? Redshift at our scale has no scaling issues, that is for sure now! – Alok Kumar Singh May 06 '21 at 13:33
  • 1
    Good to hear and good work tracking down this network bandwidth issue. This is a common issue (S3 bandwidth on COPY) and I can give you some general advice. I believe that your cluster has 4 slices by default, so 4 independent connections to S3 can be initiated in parallel. This is a big part of the performance speed up you are seeing. Each slide can load some of the S3 files and things run in parallel. Now it looks to me like your numbers are a bit off - 60 X 200kB = 12MB, not really that big. Can you confirm it shouldn't be 200MB files? – Bill Weiner May 06 '21 at 15:34
  • 1
    Now it is critical to compress the files in S3. Moving uncompressed data over a network that is your limiter is not the best approach. If your files are not compressed in S3 you can likely improve bandwidth by 2X or more. – Bill Weiner May 06 '21 at 15:36
  • 1
    Lastly don't go overboard on making lots of small S3 objects. It takes S3 in the order of .5sec to look up the object you want and start returning the data. So you want the transfer time of the data to be much larger than this object look up time. You should be able to get at least 10bT bandwidths between these services so a 1MB file will take about 1sec to transfer - a 33% overhead, not good. 100MB files will be .33$, good. This is also why I think you units may be off . – Bill Weiner May 06 '21 at 15:45
  • 1
    To directly answer you question - the maximum bandwidth to S3 from Redshift is dependent on Redshift cluster size. The more slices you have, the more parallelism there will be in the transfer. Effective bandwidth will also depend on how well your data files compress but compressing is always a good idea. – Bill Weiner May 06 '21 at 15:49
  • Hey thanks for the inputs again. It was not big files actually. We were batching every 60 seconds. So it must have been a lot of smaller files. Now we batch at 200kb and load 50-60 such files at one load. Total slices per node is 2, total nodes are 2. Also we have enabled gzip compression. Was the perf gain due to consitent 200kb files now, as less no of files download! What could be the max for this cluster? 1mb/s is good? – Alok Kumar Singh May 06 '21 at 17:03
  • We are doing sink of 1000s of tables and want to keep lag less than 15mins. Many tables get very less no of events in 15mins, the example I shared above had 200kb batch files and we are loading 10mb(50-60 of such files in one load). – Alok Kumar Singh May 06 '21 at 17:31
  • 1
    I haven't don't the measurement experiment of measuring S3 to Redshift bandwidth in a while but when I did it generally looked like a 10mb/sec network. It could be higher now. You should be able to see 40mb/sec or 5mBytes/sec if you can maximize things. However your files sizes are working against you at this point - 200kB is too small. This is due to the .5sec access overhead per file I mentioned. Try reducing the number of files down to say 12 (from 60). This should improve the performance while ensuring that all slices stay busy. You won't get the max but I bet your BW will double. – Bill Weiner May 06 '21 at 20:04
  • What i learnt the hard way. I got happy seeing 2Mb/second speed for the table, only to realise later. The total download speed for a cluster is limited. If you are downloading many tables together the bandwidth is shared. So you you would see 1 table at 2mb/second but many others would be in kb/second at the same time. So if you load only one table at a time, its good, but many many tables together is not good. I have a 30 minute window and thousands of table. How do I fit them for max performance... – Alok Kumar Singh May 07 '21 at 19:12
  • Or this is also due to queuing in Redshift making the speed drop to 300kb/sec drop from 2Mb/sec. How do I solve this without making the cluster big! – Alok Kumar Singh May 07 '21 at 19:22
  • Check this https://drive.google.com/file/d/1cWnckCo6Q19yXKTrAmFpMrIV1cgXOcG5/view?usp=sharing. It is a point in time download speed of all loads download happening at that time. Scrapped by prometheus. Scrapping every 15second. Only one is 2mb/sec, rest all are kb/second. I did see some queuing also the same time in Redshift. – Alok Kumar Singh May 07 '21 at 19:38
  • I think there may be a combination of things happening. First each slice of your cluster can load 1 file at a time, so yes your COPYs will have to share the slice. The bandwidth numbers you are stating are low so I don't think it's the bandwidth of the node itself. More likely it is due to the small size of the files - that .5sec to index an object is time when the network is idle. This makes the realized bandwidth low. Is there any way you can make larger files? Combine several tables data together if they have the same columns? Just brainstorming. – Bill Weiner May 07 '21 at 21:53
  • Another issue you could be hitting is S3 hashing. Do all these files come from the same bucket? Do they have very similar keys? If your files all hash to the same few nodes in S3 then there can be an S3 hot spot that can impact performance. S3 is a cluster just like Redshift and bucket name and portions of the key (full path key) are hashed to decide where in the cluster the file resides. The S3 accesses can be interfering with each other. – Bill Weiner May 07 '21 at 21:57
  • Now for an out of the box thought. Have you looked at Redshift Spectrum? With the right S3 organization the new data would be available as soon as the data is put in S3 (not COPY required). If the performance profile of Spectrum meets your needs you will be good to go. If not you could load the data to Redshift once per day and use Spectrum for the hourly data since the last daily COPY. Some views would then be used to combine these sets of data. This way only the most recent data is accessed through Spectrum and the bulk of the data is native. – Bill Weiner May 07 '21 at 22:03
  • Thanks @BillWeiner we distributed the load over time and we have managed to keep the lag ~ 15mins for now. – Alok Kumar Singh May 13 '21 at 08:00
  • Another optimization to reduce the not required load https://stackoverflow.com/a/67544874/4106031 – Alok Kumar Singh May 15 '21 at 08:50
  • @BillWeiner hey bill we have opensourced our internal project https://towardsdatascience.com/open-sourcing-tipoca-stream-f261cdcc3a13 the code is now open for contribution. Please take a look https://github.com/practo/tipoca-stream – Alok Kumar Singh Jun 19 '21 at 08:39

0 Answers0