1

EMR Newbie Alert:

We have large logs containing the usage data of our web site. Customers are authenticated and identified by their customer id. Whenever we try to troubleshoot a customer issue we grep through all the logs (using the customer_id as search criteria) and pipe the results into a file. Then we use the results file to troubleshoot the issue. We were thinking about using EMR to create per-customer log files so we don't have to create a per-customer log file on demand. EMR would do it for us every hour for every customer.

We were looking at EMR streaming and produced a little ruby script for the map step. Now we have a large list of key/values (userid, logdata).

We're stuck with the reduce step however. Ideally I'd want to generate a file with all the logdata of a particular customer and put it into an S3 bucket. Can anybody point us to how we'd do this? Is EMR even the technology we want to use?

Thanks, Benno

3 Answers3

0

One possibility would be to use the identity reducer, stipulating the number of reduce tasks via property beforehand. You would arrive at a fixed number of files, in which all the records for a set of users would live. To find the right file to search for a particular user, hash the user id to determine the right file and search therein.

If you really want one file per user, your reducer should generate a new file every time it is called. I'm pretty sure there are plenty of s3 client libraries available for ruby.

Judge Mental
  • 5,209
  • 17
  • 22
0

Take a look at Splunk. This is an enterprise-grade tool designed for discovering patterns and relationships in large quantities of text data. We use it for monitoring the web and application logs for a large web site. Just let Splunk index everything and use the search engine to drill into the data -- no pre-processing is necessary.

Just ran across this: Getting Started with Splunk as an Engineer

Bob Nadler
  • 2,755
  • 24
  • 20
0

Without looking at your code, yes, this is typically pretty easy to do in MapReduce; the best case scenario here is if you have many, many users (who doesn't want that?), and a somewhat limited number of interactions per user.

Abstractly, your input data will probably look something like this:

File 1:
1, 200, "/resource", "{metadata: [1,2,3]}"

File 2:
2, 200, "/resource", "{metadata: [4,5,6]}"
1, 200, "/resource", "{metadata: [7,8,9]}"

Where this is just a log of user, HTTP status, path/resource, and some metadata. Your best bet here is to really only focus your mapper on cleaning the data, transforming it into a format you can consume, and emitting the user id and everything else (quite possibly including the user id again) as a key/value pair.

I'm not extremely familiar with Hadoop Streaming, but according to the documents: By default, the prefix of a line up to the first tab character is the key, so this might look something like:

1\t1, 200, "/resource", "{metadata: [7,8,9]}"

Note that the 1 is repeated, as you may want to use it in the output, and not just as part of the shuffle. That's where the processing shifts from single mappers handling File 1 and File 2 to somethig more like:

1:
1, 200, "/resource", "{metadata: [1,2,3]}"
1, 200, "/resource", "{metadata: [7,8,9]}"

2:
2, 200, "/resource", "{metadata: [4,5,6]}"

As you can see, we've already basically done our per-user grep! It's just a matter of doing our final transformations, which may include a sort (since this is essentially time-series data). That's why I said earlier that this is going to work out much better for you if you've got many users and limited user interaction. Sorting (or sending across the network!) tons of MBs per user is not going to be especially fast (though potentially still faster than alternatives).

To sum, it depends both on the scale and the use case, but typically, yes, this is a problem well-suited to map/reduce in general.

Marc Bollinger
  • 3,109
  • 2
  • 27
  • 32