Mahout - How to map String ids to long ones for large dataset?

Question

I know that if we are going to use Mahout's recommender library, no matter it's distributed or not, we have to transform the String ids to long ones at the first place.

If the dataset is not too large, everything is fine, I can easily use a in-memory table or IDMigrator to map the String id to long ids. However, this String-to-long mapping job becomes a bottleneck when the dataset size is large.

For example, say I have 10 million of users, even if I can do data preprocessing and model training on EMR very quickly. I still have to traverse all these user String ids on some single machine in order to generate long ids for these users without collision, this obviously not infeasible if the user number keeps increasing.

Even if I store this mapping in some kind of databases, assigning long ids for new users still need to be single threaded to avoid assigning duplicate ids, again this task is a bottleneck which cannot be scaled.

So is there any better approach for doing String id to long id mapping?

score 0 · Answer 1 · answered May 13 '15 at 21:03

Consider not doing all the String => Long mapping on single machine. I don't know the specific of Your String identifier, but maybe simple hash function (e.g. from Hive) will be Ok. If not, take into account making one by yourself (e.g. based on MD5). It should be the easiest approach. As You have access to EMR You can map every single id before running recommender and then remap it afterwards back to original string.

Mahout - How to map String ids to long ones for large dataset?

1 Answers1