Recommender API with dataproc in production

Question

I am currently trying to build a recommender engine for an ecommerce site. I have come across this which outlines the usage of dataproc.

I also got Prediction.io running, which seems to be a neat project to build such services ... although it is a bit abandoned at the moment.

Now the problem I have with the solution offered in the dataproc article is, that it doesn't scale. The results for the recos are stored in Mysql and I am supposed to use some third party web service to expose them.

Now this might work for small workloads, but when I got eg. 100.000 products and 300.000 users and there are continuously new users and products coming in, I'll end up bombing the database with updates just to keep up with all the changes. I would think, Mysql is not the best solution for this.

I guess it is much more robust to deploy the trained model (ALS in this case) to a webserver running on the spark cluster, query it at runtime and serve the results. When a new model is trained, it replaces the old one.

So is it actually possible to do this? Can I run my own applications on the dataproc cluster? So far I have only been able to schedule jobs through the gcloud cli tool, but I cannot access the cluster through the default 7077 port.

Is this within the intended usage of DataProc or is it more of a "crunch data and store it somehwere" type

Bests

score 1 · Accepted Answer · answered Mar 07 '17 at 19:13

I'll provide a partial answer:

There are no restrictions on how you use your cluster. You can install additional software via ssh [2] on each VM or you can automate the installation and use initialization actions [1].

You'd have to modify your firewall settings to make ports on VMs accessible. But be aware, that this would make them visible to the world (not just you). One option is to setup SSH port-forwarding [3].

We generally encourage short lived clusters. Your options are to use Cloud SQL, Cloud Bigtable, etc or to setup separate VMs with mySQL/etc.

[1] https://cloud.google.com/dataproc/docs/concepts/init-actions

[2] https://cloud.google.com/compute/docs/instances/connecting-to-instance

[3] https://cloud.google.com/dataproc/docs/concepts/cluster-web-interfaces

Thanks for the answer. So I guess I rather tweak some VMs to run Spark than trying to fight the provisioning/configuration that is happening to DataProc machines automatically — wirtsi, Apr 04 '17 at 12:29

Recommender API with dataproc in production

1 Answers1