1

Good morning,

currently I'm exploring my options for building an internal platform for the company I work for. Our team is responsible for the company's data warehouse and reporting.

As we evolve, we'll be developing an intranet to answer some of the company's necessities and, for some time now, I'm considering scala (and PlayFramework) as the way to go.

This will also envolve a lot of machine learning to cluster clients, predict sales evolution, and so on. This is when I've started to think in Spark ML and came across PredictionIO.

As we are shifting our skills towards data science, what will benefit and teach us/company most:

  • build everything on top of Play and Spark and have both the plataform and machine learning on the same project
  • using Play and PredictionIO where most of the stuff is already prepared

I'm not trying to open a question opinion based, rather then, learn from your experience / architectures / solutions.

Thank you

Henrique Gonçalves
  • 1,572
  • 3
  • 16
  • 27

2 Answers2

1

Both are good options: 1. use PredictionIO if you are new to ML, easy to start but it will limit you in a long run, 2. use spark if you have confidence in your data science and data engineering team, spark has excellent and easy to use api along with extensive ML library, saying that in order to put things into production, you will require some distributed spark knowledge - experience and it is tricky at times to make it efficient and reliable.

Here are options:

  1. spark databricks cloud expensive but easy to use spark, no data engineering
  2. PredictionIO if you certain that their ML can solve all your business cases
  3. spark in google dataproc, easy managed cluster for 60% less than aws, still some engineering required

In summary: PredictionIO for a quick fix, and spark for long term data - science / engineering development. You can start with databricks to minimise expertise overheads and move to dataproc as you go along to minimise costs

elcomendante
  • 1,113
  • 1
  • 11
  • 28
  • Can you explain this line "PredictionIO will limit you in long run." – Abhimanyu Aug 09 '17 at 12:07
  • @Abhimanyu, despite the fact that `PredictionIO ` is based on `spark ML` now, you will use it mainly for model building through templates as `Deploy the Engine as Service model `. Now, building model is a small part of `ML pipeline`, typically you will need to create flexible `ETL pipeline` for `training data` , and furthermore deployment environment such as `streaming` or `batch`. Spark bridges `data science` and `engineering` through its friendly syntax, allowing you to handle whole process in one framework. – elcomendante Aug 09 '17 at 16:06
0

PredictionIO uses Spark's MLLib for the majority of their engine templates.

I'm not sure why you're separating the two?

PredictionIO is as flexible as Spark is, and can alternatively use other libraries such as deeplearning4j & H2O to name a few.

Daniel
  • 63
  • 8