I know that if we are going to use Mahout's recommender library, no matter it's distributed or not, we have to transform the String
ids to long ones at the first place.
If the dataset is not too large, everything is fine, I can easily use a in-memory table or IDMigrator
to map the String
id to long ids. However, this String-to-long mapping job becomes a bottleneck when the dataset size is large.
For example, say I have 10 million of users, even if I can do data preprocessing and model training on EMR very quickly. I still have to traverse all these user String ids on some single machine in order to generate long ids for these users without collision, this obviously not infeasible if the user number keeps increasing.
Even if I store this mapping in some kind of databases, assigning long ids for new users still need to be single threaded to avoid assigning duplicate ids, again this task is a bottleneck which cannot be scaled.
So is there any better approach for doing String id to long id mapping?