-1

I am trying to use s3 link provided to me https://ml-cloud-dataset.s3.amazonaws.com/Airlines_data.txt in putty terminal. So that I can create table in hive and load the dataset into it.

I tried to download data set using code:

aws s3 cp https://ml-cloud-dataset.s3.amazonaws.com/Airlines_data.txt /home/hadoop . 

This code gave me error and I tried using multiple ways still failed to get the data.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 2
    *"Gave me error"* is pretty useless error description. Anyway, *"public s3 url"* is simple HTTP URL, so you can use any tool that speaks HTTP, like `wget`. See [How to download a file from HTTP URL?](https://stackoverflow.com/q/19693280/850848) – Martin Prikryl Aug 05 '23 at 12:08
  • You can use SparkSQL to query S3 directly. You shouldn't have to download anything unless you want faster queries – OneCricketeer Aug 05 '23 at 13:06

2 Answers2

0

If you use aws s3 cp, you need to have aws cli installed. If you have it installed, you can upload file using

aws s3 cp myfilename.txt s3://mybucketname/mypath/myfilename.txt

To download the file, you can use

aws s3 cp s3://mybucketname/mypath/myfilename.txt myfilename.txt

Depending on your aws setup, you need to either access keys or use sso for login. If the machine is in EC2, you can also IAM roles which will let you login without sso or access keys.

skzi
  • 340
  • 3
  • 14
0

The URL https://ml-cloud-dataset.s3.amazonaws.com/Airlines_data.txt is saying:

  • The bucket name is ml-cloud-dataset
  • There is an object called Airlines_data.txt

Fortunately, it is a publicly accessible bucket, so you can list the contents with the AWS CLI:

$ aws s3 ls ml-cloud-dataset

2020-03-06 23:32:55   10237044 Airlines_data.txt
2020-03-06 23:33:15         84 dept
2020-03-06 23:33:15        218 employee
2020-03-06 23:33:15       1666 hive_key.cer
2020-03-06 23:33:15      22628 u.user

You can copy the object to your own bucket using:

aws s3 cp s3://ml-cloud-dataset/Airlines_data.txt s3://your-bucket/

To copy ALL the objects, use:

aws s3 sync s3://ml-cloud-dataset/ s3://your-bucket/

However, if you are using Hive within AWS you possibly don't even need to download the files -- you could just reference it directly using s3://ml-cloud-dataset/Airlines_data.txt.

You could also access it from Amazon Athena using that same path.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470