I'm trying to MapReduce logs, and I'd like to filter all logs in a bucket by filename before processing them in EMR. Also, some files are tar directories, and I'd like mrjob to uncompress it, then filter files in it to only parse the relevant one.
Any idea how to filter a bucket s3 by filename from MrJob? I found the mapper_pre_filter! method, but it only filter the input line by line.