0

I've been doing some research on how to move zipped S3 data to Elasticsearch. On Aws website there is information where you could create a lambda that unzips file and re-upload it then move it to ES. Right now since I do not have too large dataset I am downloading data into my local computer and sending it to ElasticSearch with correct format. Both method seems inefficient and I am wondering if there is a way to unzip file then move it to Elasticsearch w/o downloading or re-uploading data.

Right now this is my code:

s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(key, 'download_path')
ip_pattern = re.compile('(\d+\.\d+\.\d+\.\d+)')
time_pattern = re.compile('\[(\d+\/\w\w\w\/\d\d\d\d:\d\d:\d\d:\d\d\s\+\d\d\d\d)\]')
message_pattern = re.compile('\"(.+)\"')

with gzip.open('download_path') as files:
    data = ""

    document = {}
    for line in files:
        line = line.decode("utf-8") # decode byte to str

        ip = ip_pattern.search(line).group(0)
        timestamp = time_pattern.search(line).group(0)
        message = message_pattern.search(line).group(0)

        document = { "ip": ip, "timestamp": timestamp, "message": message }

If there isn't any better way I will use above code.

haneulkim
  • 4,406
  • 9
  • 38
  • 80
  • Where's the elasticsearch instance hosted - on AWS too? I assume the lambda would work without downloading and reuploading wouldn't it, since it will run within the same AWS region? Why do you think that is inefficient? Or you could create a temporary EC2 instance and run your script there if you don;t want to use a lambda. – Rup Oct 28 '19 at 01:37
  • Yes ES is hosted on AWS. lambda works when there is an event so I couldn't find a way to move previous stored files using lambda, only newly coming data. 1. I think it is inefficient to re-upload then move it to es 2. Downloading all the files which I going to have to delete seems inefficient. – haneulkim Oct 28 '19 at 01:45
  • You can trigger lambdas from the AWS console to run now. You have to give them fake data I think, but there's a 'hello world' JSON sample you can just post. OK, but I'm saying if you use a lambda or run your script from an EC2 instance then you will not be downloading data out of the AWS region or reuploading it. – Rup Oct 28 '19 at 01:48

1 Answers1

0

On Aws website there is information where you could create a lambda that unzips file and re-upload it then move it to ES.

You don't need to re-upload the exploded data back in S3 unless it is required for any other purposes

or

the unzipping and indexing the data in files extracted from the zip cannot be send to elasticcache within the maximum lambda execution time.

I believe the reason to extract them and push to separate S3 bucket could be to scan one within one execution of lamda to make it a logical execution unit and that would fit in lambda maximum available execution time.

Unzip zip files from S3

Juned Ahsan
  • 67,789
  • 12
  • 98
  • 136
  • I've read the resource on AWS website however it seems to work with newly coming in data not previously added data. – haneulkim Oct 28 '19 at 01:46
  • @makewhite you can read an s3 file in lambda using any sdk. Is there anything that you are referring to? – Juned Ahsan Oct 28 '19 at 01:48
  • I could not find any resource about already existing data but here is for streaming data.https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html#es-aws-integrations-s3-lambda-es – haneulkim Oct 28 '19 at 01:54
  • 1
    I have added a link with code to unzip the data from an S3 bucket, you can take the code from amazon link you have mentioned to push the data in elasticsearch. It requires a bit of coding, don't have all in one place solutioin – Juned Ahsan Oct 28 '19 at 01:58