I have a log file of status_changes, each one of which has a driver_id, timestamp, and duration. Using driver_id and timestamp, I want to fetch the appropriate GPS log from S3. These GPS logs are stored in an S3 bucket in the form bucket_name/yyyy/mm/dd/driver_id.log.
from mrjob.job import MRJob
class Mileage(MRJob):
def get_s3_gpslog_path(self, driver_id, occurred_at, status):
s3_path = "s3://gps_logs/{yyyy}/{mm}/{dd}/{driver_id}.log"
s3_path = s3_path.format(yyyy=occurred_at.year,
mm=occurred_at.month,
dd=occurred_at.day,
driver_id=driver_id)
return s3_path
def mapper(self, _, line):
line = ast.literal_eval(line)
driver_id = line['driverId']
occurred_at = line['timestamp']
status = line['status']
s3_path = self.get_s3_gpslog_path(driver_id, occurred_at, status)
# ^^ How do I fetch this file and read it?
distance = calculate_distance_from_gps_log(s3_path, occurred_at, status)
yield status, distance
if __name__ == '__main__':
Mileage.run()
And from the command line I run it with the status_change log file as input: $ python mileage.py status_changes.log
My question is: How do I actually fetch that GPS log, given the S3 URI string I have constructed?