Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
18
votes
2 answers

Incremental training of ALS model

I'm trying to find out if it is possible to have "incremental training" on data using MLlib in Apache Spark. My platform is Prediction IO, and it's basically a wrapper for Spark (MLlib), HBase, ElasticSearch and some other Restful parts. In my app…
17
votes
2 answers

KMeans clustering in PySpark

I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 clusters based on just those 2 columns and then I want to…
17
votes
3 answers

Spark Word2vec vector mathematics

I was looking at the example of Spark site for Word2Vec: val input = sc.textFile("text8").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms("country name here",…
user3803714
  • 5,269
  • 10
  • 42
  • 61
17
votes
1 answer

How to get word details from TF Vector RDD in Spark ML Lib?

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [,
Srini
  • 3,334
  • 6
  • 29
  • 64
17
votes
2 answers

Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

I'm writing a spark application and would like to use algorithms in MLlib. In the API doc I found two different classes for the same algorithm. For example, there is one LogisticRegression in org.apache.spark.ml.classification also a…
ailzhang
  • 173
  • 1
  • 4
17
votes
5 answers

PySpark & MLLib: Random Forest Feature Importances

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel. How can I extract feature…
Bryan
  • 5,999
  • 9
  • 29
  • 50
17
votes
1 answer

apache spark MLLib: how to build labeled points for string features?

I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents. I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems…
16
votes
3 answers

How to overwrite entire existing column in Spark dataframe with new column?

I want to overwrite a spark column with a new column which is a binary flag. I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas? How to do it without using withcolumn() to create new column and…
16
votes
1 answer

How to convert ArrayType to DenseVector in PySpark DataFrame?

I'm getting the following error trying to build a ML Pipeline: pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually…
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
16
votes
4 answers

PySpark computing correlation

I want to use pyspark.mllib.stat.Statistics.corr function to compute correlation between two columns of pyspark.sql.dataframe.DataFrame object. corr function expects to take an rdd of Vectors objects. How do I translate a column of df['some_name']…
VJune
  • 1,195
  • 5
  • 16
  • 26
16
votes
1 answer

Spark ML indexer cannot resolve DataFrame column name with dots?

I have a DataFrame with a column named a.b. When I specify a.b as the input column name to a StringIndexer, AnalysisException with the message "cannot resolve 'a.b' given input columns a.b". I'm using Spark 1.6.0. I'm aware that older versions of…
Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
16
votes
1 answer

Why spark.ml don't implement any of spark.mllib algorithms?

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib, built on top of RDDs. spark.ml, built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and…
16
votes
1 answer

What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

I Wanted to try an example of ALS machine learning algorithm. And my code works fine, However I do not understand parameter rank used in algorithm. I have following code in java // Build the recommendation model using ALS int rank = 10; …
hard coder
  • 5,449
  • 6
  • 36
  • 61
16
votes
4 answers

What is the right way to save\load models in Spark\PySpark

I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation ) from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating data =…
artemdevel
  • 641
  • 1
  • 9
  • 21
15
votes
1 answer

Spark ML VectorAssembler returns strange output

I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this. My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also…
Dimitris
  • 2,030
  • 3
  • 27
  • 45