Without looking at your code, yes, this is typically pretty easy to do in MapReduce; the best case scenario here is if you have many, many users (who doesn't want that?), and a somewhat limited number of interactions per user.
Abstractly, your input data will probably look something like this:
File 1:
1, 200, "/resource", "{metadata: [1,2,3]}"
File 2:
2, 200, "/resource", "{metadata: [4,5,6]}"
1, 200, "/resource", "{metadata: [7,8,9]}"
Where this is just a log of user, HTTP status, path/resource, and some metadata. Your best bet here is to really only focus your mapper on cleaning the data, transforming it into a format you can consume, and emitting the user id and everything else (quite possibly including the user id again) as a key/value pair.
I'm not extremely familiar with Hadoop Streaming, but according to the documents: By default, the prefix of a line up to the first tab character is the key,
so this might look something like:
1\t1, 200, "/resource", "{metadata: [7,8,9]}"
Note that the 1
is repeated, as you may want to use it in the output, and not just as part of the shuffle. That's where the processing shifts from single mappers handling File 1
and File 2
to somethig more like:
1:
1, 200, "/resource", "{metadata: [1,2,3]}"
1, 200, "/resource", "{metadata: [7,8,9]}"
2:
2, 200, "/resource", "{metadata: [4,5,6]}"
As you can see, we've already basically done our per-user grep! It's just a matter of doing our final transformations, which may include a sort (since this is essentially time-series data). That's why I said earlier that this is going to work out much better for you if you've got many users and limited user interaction. Sorting (or sending across the network!) tons of MBs per user is not going to be especially fast (though potentially still faster than alternatives).
To sum, it depends both on the scale and the use case, but typically, yes, this is a problem well-suited to map/reduce in general.