0

I have a log file of status_changes, each one of which has a driver_id, timestamp, and duration. Using driver_id and timestamp, I want to fetch the appropriate GPS log from S3. These GPS logs are stored in an S3 bucket in the form bucket_name/yyyy/mm/dd/driver_id.log.

from mrjob.job import MRJob
class Mileage(MRJob):

    def get_s3_gpslog_path(self, driver_id, occurred_at, status):
        s3_path = "s3://gps_logs/{yyyy}/{mm}/{dd}/{driver_id}.log"
        s3_path = s3_path.format(yyyy=occurred_at.year,
                                 mm=occurred_at.month,
                                 dd=occurred_at.day,
                                 driver_id=driver_id)
        return s3_path

    def mapper(self, _, line):
        line = ast.literal_eval(line)
        driver_id = line['driverId']
        occurred_at = line['timestamp']
        status = line['status']
        s3_path = self.get_s3_gpslog_path(driver_id, occurred_at, status)
        # ^^ How do I fetch this file and read it?
        distance = calculate_distance_from_gps_log(s3_path, occurred_at, status)

        yield status, distance


if __name__ == '__main__':
    Mileage.run()

And from the command line I run it with the status_change log file as input: $ python mileage.py status_changes.log

My question is: How do I actually fetch that GPS log, given the S3 URI string I have constructed?

numbers are fun
  • 423
  • 1
  • 7
  • 12

1 Answers1

0

You can use S3Filesystem that is part of mrjob. You may also use boto's S3 utility within your script. You will likely need to hardcode (or parse from Hadoop configuration in each node) the (secret) access key. However, what the mapper is doing may be a bit too much, potentially making excessive requests to get S3 resources. You might be able to rewrite the MapReduce algorithm to do the join in a less resource-intensive way by streaming in the GPS logs along with the other logs.

Taro Sato
  • 1,444
  • 1
  • 15
  • 19