25

I have a web app that needs to send reports on its usage, I want to use Amazon RedShift as a data warehouse for that purpose, How should i collect the data ?

Every time, the user interact with my app, i want to report that.. so when should i write the files to S3 ? and how many ? What i mean is: - If do not send the info immediately, then I might lose it as a result of a connection lost, or from some bug in my system while its been collected and get ready to be sent to S3... - If i do write files to S3 on each user interaction, i will end up with hundreds of files (on each file has minimal data), that need to be managed, sorted, deleted after been copied to RedShift.. that dose not seems like a good solution .

What am i missing? Should i use DynamoDB instead, Should i use simple insert into Redshift instead !?
If i do need to write the data to DynamoDB, should i delete the hold table after been copied .. what are the best practices ?

On any case what are the best practices to avoid data duplication in RedShift ?

Appreciate the help!

Guy
  • 12,388
  • 3
  • 45
  • 67
Ofer Velich
  • 1,959
  • 4
  • 19
  • 24
  • http://stackoverflow.com/questions/38300416/how-to-load-files-from-zip-files-present-in-s3-to-redshift-using-copy-command – AWSDeveloper Jul 15 '16 at 17:19

5 Answers5

44

It is preferred to aggregate event logs before ingesting them into Amazon Redshift.

The benefits are:

  • You will use the parallel nature of Redshift better; COPY on a set of larger files in S3 (or from a large DynamoDB table) will be much faster than individual INSERT or COPY of a small file.

  • You can pre-sort your data (especially if the sorting is based on event time) before loading it into Redshift. This is also improve your load performance and reduce the need for VACUUM of your tables.

You can accumulate your events in several places before aggregating and loading them into Redshift:

  • Local file to S3 - the most common way is to aggregate your logs on the client/server and every x MB or y minutes upload them to S3. There are many log appenders that are supporting this functionality, and you don't need to make any modifications in the code (for example, FluentD or Log4J). This can be done with container configuration only. The down side is that you risk losing some logs and these local log files can be deleted before the upload.

  • DynamoDB - as @Swami described, DynamoDB is a very good way to accumulate the events.

  • Amazon Kinesis - the recently released service is also a good way to stream your events from the various clients and servers to a central location in a fast and reliable way. The events are in order of insertion, which makes it easy to load it later pre-sorted to Redshift. The events are stored in Kinesis for 24 hours, and you can schedule the reading from kinesis and loading to Redshift every hour, for example, for better performance.

Please note that all these services (S3, SQS, DynamoDB and Kinesis) allow you to push the events directly from the end users/devices, without the need to go through a middle web server. This can significantly improve the high availability of your service (how to handle increased load or server failure) and the cost of the system (you only pay for what you use and you don't need to have underutilized servers just for logs).

See for example how you can get temporary security tokens for mobile devices here: http://aws.amazon.com/articles/4611615499399490

Another important set of tools to allow direct interaction with these services are the various SDKs. For example for Java, .NET, JavaScript, iOS and Android.

Regarding the de-duplication requirement; in most of the options above you can do that in the aggregation phase, for example, when you are reading from a Kinesis stream, you can check that you don't have duplications in your events, but analysing a large buffer of events before putting into the data store.

However, you can do this check in Redshift as well. A good practice is to COPY the data into a staging tables and then SELECT INTO a well organized and sorted table.

Another best practice you can implement is to have a daily (or weekly) table partition. Even if you would like to have one big long events table, but the majority of your queries are running on a single day (the last day, for example), you can create a set of tables with similar structure (events_01012014, events_01022014, events_01032014...). Then you can SELECT INTO ... WHERE date = ... to each of this tables. When you want to query the data from multiple days, you can use UNION_ALL.

Guy
  • 12,388
  • 3
  • 45
  • 67
  • Thanks Guy, your answer is very detailed, I guess I will make use of S3, as a solution to the aggregation of my data. But one thing I do not sure i have understand, why you're you claims that i will not have to go through a middle web server? – Ofer Velich Jan 14 '14 at 13:05
  • You can upload objects directly to S3. See here: http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlUploadObject.html – Guy Jan 15 '14 at 00:08
  • `and you can schedule the reading from kinesis and loading to Redshift every hour` - any links on how to do this, please? – Kevin Meredith Jan 26 '15 at 18:09
  • 1
    Kinesis Connectors Library and the relevant properties file: https://github.com/awslabs/amazon-kinesis-connectors/blob/master/src/main/samples/redshiftbasic/RedshiftBasicSample.properties – Guy Jan 28 '15 at 17:16
  • 1
    Is it possible to use AWS Lambda, rather than the Java connectors library, to copy data from `S3 -> Redshift`? I have not seen any docs on using Lambda to copy data from S3 to Redshift. – Kevin Meredith Jan 30 '15 at 16:52
  • Sure. What ever you can do with node.js you can do with Lambda. Initiate the COPY command from Lambda after you connected to the Redshift cluster. You don't get parallel processing and need to wait for the end of the copy (might take some time), therefore it is not an optimal way, but it is possible. – Guy Jan 30 '15 at 17:53
  • Thanks. With respect to the `aws-kinesis-connectors`'s Redshift example, do I simply run the ANT build (after modifying code for my project) locally? If the program stops running, i.e. I hit `CTRL + z`, will the Stream still output to Redshift? – Kevin Meredith Jan 30 '15 at 18:55
  • Kinesis Stream doesn't output events by itself. it is a buffer that stores the events. If the program is not reading the events, they are not getting to Redshift. – Guy Jan 31 '15 at 08:43
  • So, to get the Kinesis stream into Redshift, I can use (1) `aws-kinesis-connectors`, (2) create an AWS Lambda function that, upon a `PutRecord` to Kinesis, runs an `INSERT` into Redshift or (3) ???. From your help answers, I believe #2 is poor since it requires too many open connections to Redshift - plus `COPY` > `INSERT` per your answer to this question. #1 seems like a good way to go, but it's not clear to me how to use the project for handling JSON records from Kinesis -> Redshift. – Kevin Meredith Feb 01 '15 at 17:33
  • The Lambda function will get a batch of events and not only a single event, and it can update a manifest file (using a DynamoDB table for example), and COPY only once you have enough files. – Guy Feb 14 '15 at 10:21
  • Do I still need to create partition tables in redshift ? Redshift is supposed to give better performance with large tables right? – DBS Jul 04 '17 at 07:28
  • @DBS, no need to create multiple tables just for partition sake, but you may consider putting in Redshift only the more recent queried data (for example, for the last 3 months), and have the rest of the data in S3, and query it with Spectrum or Athena. Redshift is more a like a cache for often queries data and not the only place for your data. – Guy Jul 04 '17 at 14:30
6

One option to consider is to create time series tables in DynamoDB where you create a table every day or week in DynamoDB to write every user interaction. At the end of the time period (day, hour or week), you can copy the logs on to Redshift.

For more details, on DynamoDB time series table see this pattern: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns

and this blog:

http://aws.typepad.com/aws/2012/09/optimizing-provisioned-throughput-in-amazon-dynamodb.html

For Redshift DynamoDB copy: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/RedshiftforDynamoDB.html

Hope this helps.

2

Though there is already an accepted answer here, AWS launched a new service called Kinesis Firehose which handles the aggregation according to user defined intervals, a temporary upload to s3 and the upload (SAVE) to redshift, retries and error handling, throughput management,etc...

This is probably the easiest and most reliable way to do so.

Froyke
  • 1,115
  • 7
  • 13
2

You can write data to CSV file on local disk and then run Python/boto/psycopg2 script to load data to Amazon Redshift.

In my CSV_Loader_For_Redshift I do just that:

  1. Compress and load data to S3 using boto Python module and multipart upload.

    conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
    bucket = conn.get_bucket(bucket_name)
    k = Key(bucket)
    k.key = s3_key_name
    k.set_contents_from_file(file_handle, cb=progress, num_cb=20, 
    reduced_redundancy=use_rr )
    
  2. Use psycopg2 COPY command to append data to Redshift table.

    sql="""
    copy %s from '%s' 
    CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s' 
    DELIMITER '%s' 
    FORMAT CSV %s 
    %s 
    %s 
    %s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)
    
Alex B
  • 2,165
  • 2
  • 27
  • 37
1

Just being a little selfish here and describing exactly what Snowplow ,an event analytics platform does. They use this awesome unique way of collecting event logs from the client and aggregating it on S3.

They use Cloudfront for this. What you can do is, host a pixel in one of the S3 buckets and put that bucket behind a CloudFront distribution as an origin. Enable logs to an S3 bucket for the same CloudFront.

You can send logs as url parameters whenever you call that pixel on your client (similar to google analytics). These logs can then be enriched and added to Redshift database using Copy.

This solves the purpose of aggregation of logs. This setup will handle all of that for you.

You can also look into Piwik which is an open source analytics service and see if you can modify it specific to your needs.

Sambhav Sharma
  • 5,741
  • 9
  • 53
  • 95