2

I am working on a Data processing application hosted as a web service on an EC2, each second a small data file (less than 10KB) in .csv format is generated.

Problem Statement: Archive all the data files generated to Amazon Glacier.

My Approach : As data files are very small. I store the files in AWS Kinesis and after few hours i flush data to S3 (because i cannot find a direct way to put data from Kinesis to Glacier) and using S3 lifecycle management at the end of the day i archive all the objects to Glacier.

My Questions :

  1. Is there a way to transfer data to Glacier directly from Kinesis ?

  2. Is it possible to configure Kinesis to flush data to S3/Glacier at the end of the day ? Is there any time or memory limitation upto which Kinesis can hold data ?

  3. If Kinesis cannot transfer data to Glacier directly. Is there a workaround for this like - can i write a lambda function which can fetch data from Kinesis and archive it to Glacier ?

  4. Is it possible to merge all the .csv file at Kinesis or S3 or Glacier level ?

  5. Is Kinesis suitable for my usecase ? Is there anything else i can use ?

I would be grateful if someone can take time and answer my questions and point me to some references. Please let me know if there is a flaw in my approach or if there is a better way to do so.

Thanks.

Rajat Khandelwal
  • 477
  • 1
  • 5
  • 19
  • It sounds like this is a very low volume application where [Amazon SQS might be a better/cheaper solution than Kinesis](https://stackoverflow.com/a/49735246/836214): Push/pull from SQS is much simpler, including for hookup via lambda. – Krease Jul 03 '18 at 18:52

1 Answers1

1
  1. You can't directly put data from Kinesis into Glacier (unless you want to put the 10kb filea directly into Glacier)
  2. You could consider Kinesis Data Firehose as a way of flushing 15min. Increments of data to S3
  3. You can definitely do that. Glacier allows direct uploads so there's no need to upload to S3 first
  4. You could use Firehose to flush to S3 then transform and aggregate using Athena then transition that file to Glacier. Or you use Lambda directly and upload straight to Glacier.
  5. Perhaps streaming data into Firehose would make more sense. Depending on your exact needs IoT Analytics might also be interesting.

Reading your question again, seeing you use csv files, I would highly recommend using the Kinesis > S3 > Athena > Transition to glacier approach

Exelian
  • 5,749
  • 1
  • 30
  • 49
  • Thanks for your response. I did not get your answer to question 2. As per aws docs -> https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html#buffer data is transfer to destination (S3) if one of the buffer conditions are satisfied. I want to transfer data at the end of the day. Is it possible to do using Kinesis Data Firehose ? Did i get something wrong ? – Rajat Khandelwal Jun 30 '18 at 06:32
  • If you look at the docs you'll see it either 128MB or 15 min, whatever happens first. So Firehose will at most buffer 15 minutes of dsta at once – Exelian Jun 30 '18 at 09:09
  • Thanks for the followup @exelian. So we cannot do it at the end of the day. Its 128MB or 15 min which ever is earlier. My another doubt is I donot need to stream data in a real time. So do you think there is better alternative to Kinesis, because data which i get form underlying application is very small and application will work for only few minutes every hour. So i think flushing data after every 15 mins will be a performance hit. Please let me know ur thoughts on this and is there a way to implement "end of the day data flushing functionality" ? – Rajat Khandelwal Jun 30 '18 at 10:30
  • And do you think Athena will play important role here because we donot want to query the .csv files we just want to merge all of them either at Queue or S3 level. Can we implement this using some lambdas or something ? – Rajat Khandelwal Jun 30 '18 at 10:35
  • 1
    You could use Athena to simply perform 1 query which returns all relevant data. It will output 1 file to S3, which makes it easy to handle. I think you'll just have to try both options, I know that setting up the firehose and Athena will cost you 3 hours at most – Exelian Jun 30 '18 at 10:39
  • I donot need to stream data in a real time. So do you think there is better alternative to Kinesis, because data which i get form underlying application is very small and application will work for only few minutes every hour. So i think flushing data after every 15 mins will be a performance hit. Please let me know ur thoughts on this and is there a way to implement "end of the day data flushing functionality" ? and can you please point me some links or documents relevant to this ? – Rajat Khandelwal Jun 30 '18 at 10:45