I am currently trying to build a recommender engine for an ecommerce site. I have come across this which outlines the usage of dataproc.
I also got Prediction.io running, which seems to be a neat project to build such services ... although it is a bit abandoned at the moment.
Now the problem I have with the solution offered in the dataproc article is, that it doesn't scale. The results for the recos are stored in Mysql and I am supposed to use some third party web service to expose them.
Now this might work for small workloads, but when I got eg. 100.000 products and 300.000 users and there are continuously new users and products coming in, I'll end up bombing the database with updates just to keep up with all the changes. I would think, Mysql is not the best solution for this.
I guess it is much more robust to deploy the trained model (ALS in this case) to a webserver running on the spark cluster, query it at runtime and serve the results. When a new model is trained, it replaces the old one.
So is it actually possible to do this? Can I run my own applications on the dataproc cluster? So far I have only been able to schedule jobs through the gcloud cli tool, but I cannot access the cluster through the default 7077 port.
Is this within the intended usage of DataProc or is it more of a "crunch data and store it somehwere" type
Bests